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P'^f "* '"Tf 'f^^^ to *e «eld of computer systems. I^ore specifically, the present invention relates to 

computer systems for visualizing biological sequences, as well as for evaluating and comparing biological sequences 
f ®^ ^ computer systems for forming and using arrays of materials on a substrate are known Fbrexamole 
PCT applications WO92/10588 and 95/1 1995. incorporated herein by reference for all purposes. dSSbeTeSSK 
for sequencing or sequence checking nucleic acids and other materials. Arrays for performing these operations may be 
?!.T™ to the methods of. for example, the pioneering techniques disclosed in U.S. Patent Nos 

5,445.934 and 5384.261. and U.S. Patent Application Na 08^49.188. each incorporated herein by reference for^i 
<5 purposes. 

According to one aspect of the techniques described therein, an array of nucleic acid probes is fabricated at known 

an image file (also called a cell file) indicating the locations where the labeled nucleic acids bound to the chip Based 
upon the image file and identities of the probes at specific locations, it becomes possible to extract information such as 
the monomer sequence of DNA or RNA. Such systems have been used to form, for example, arrays of DNA that may 

^?Sn^'3iSS»iS ^ ''^^ ^'^^ *° ^ 

Improved computer systems and methods are needed to evaluate, analyze, and process the vast amount of infor- 
mation now used and made available by ttiese pioneering technologies. 

SUMMARY OF THE INVENTION 

An improved computer-aided system for visualizing and determining the sequence of nucleic adds is disclosed 
The computer system provides, among other things, improved methods of analyzing fluorescent image files of a chip 
containing hybridized nucleic acid probes in order to call bases in sample nucleic add sequences 

According to one aspect of the invention.acomputer system is usedtoidentify an unknown bro^ 
aaa sequence oy the steps of: 

- ir^xjtting multiple probe intensities, each of the probe intensities being assodated with a nucleic acid probe- 

- the computer system comparing the multiple probe intensities where eadi of the probe intensities is substantiaHy 
proportional to a nudeic add probe hybridizing with at least one nucleic acid sequence; and 

calling the unknown base according to the results of the comparison of the multiple probe intensities. 

According to one spedf ic aspect of the invention, a higher probe intensity is compared to a lower probe intensity to 
call the unknown base. According to another specific aspect of the invention, probe intensities of a sample sequence 
are compared to probe intensities of a reference sequence. According to yet anotiier spedf ic aspect of the invention 
SplT^w^ellte ^"^^ are compared to statistics about probe intensities of a reference sequence froni 

According to another aspect of the invenlton. a meUiod is disdosed of processing reference and sample nucleic 
acid sequences to reduce the variatfons between the experiments by the steps of: 

providing a plurality of nucleic acid probes; 

- labeling the reference nudeic add sequence with a first marker; 
• labeling the sample nudeic acid sequence with a second marker; and 
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hybridizing the labeled reference and sample nucleic acid sequences at the same time. 

According to another aspect of the invention, a computer system is used to identify mutations in a sample nucleic 
acid sequence by the steps of: 

5 - inputting a first set of probe intensities, each of tiie probe intensities in said f isrt set being associated with a nucleic 
add probe and substantially proportional to tiie associated nucleic acid probe hybridizing witti a reference nudeic 
add sequence; 

inputting a second set of probe intensities, each of the probe intensities in said fisrt set being assodated with a 
nudeic acid probe and substantially proportional to tiie assodated nucleic add probe hybridizing with said sample 
10 sequence; 

- the computer system comparing probe intensities in the first set to probe intensities in the second set to select 
hybridization regions where the probe intensities in ttie first and second sets differ; and 

identifying mutations according to characteristics of the selected regions. 
75 According to yet anotiier aspect of the invention, a computer system is used for comparative analysis and visuali- 
zation of multiple sequences by the steps of: 

- displaying at least one reference sequence in a first area on a display device; and 

- displaying at least one sample sequence in a second area on said display device; 

20 

whereby a user is capable of visually comparing the multiple sequences. 

A further understanding of tiie nature arxJ advantages of the inventions herein may be realized by reference to the 
remaining portions of the specification and the attached drawings. 

25 BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 illustrates an example of a computer system used to execute the software of tiie present invention; 

Fig. 2 shows a system block diagram of a typical computer system used to execute the software of tiie present 

invention; 

30 Fig. 3 Illustrates an overall system for forming and analyzing arrays of biological materials such as DNA or RNA; 

Fig. 4 is an illustration of tiie software for tiie overall system; 

Fig. 5 illustrates the global layout of a chip formed in the overall system; 

Fig. 6 illustrates conceptually the binding of probes on chips; 

Fig. 7 illustrates probes ananged in lanes on a chip; 
35 Fig. 8 illustrates a hybridization pattern of a target on a chip witii a reference sequence as in Fig. 7; 

Fig. 9 illustrates tiie high level flow of the intensity ratio method; 

Fig. 1 0A illustrates the high level flow of one implementation of the reference metiiod and Fig. 1 0B shows an analysis 
table for use with the reference method; 

Fig. 11 A illustrates the high level flow of another implementation of the reference metiiod; Fig. 1 1 B shows a data 
40 tak)le for use with the reference method; Rg. 1 1 C shows a graph of the normalized sanple base intensities minus 
the normalized reference base intensities; and Fig. 1 1 D shows otiier graphs of data in tiie data table; 
Fig. 12 illustrates the high le^el flow of the statistical method; 

Fig. 13 illustrates the pooling processing of a reference and sample nucleic add sequence; 
Figs. 14A and 14C show graphs of scaled fluorescent intensities of wild-type probes hybridizing with sanple and 
45 reference sequences and 1 4B shows a hypotiietical graph of fluorescent intensities of wild-type probes hybridizing 
with two sample sequences and a reference sequence; 

Fig. 15 illustrates the high level flow of an embodiment tiiat uses tiie hybridization data from than one base position 
to identify mutations in a sample sequence; 

Fig. 16 illustrates the main screen and the assodated pull down menus for comparative analysis and visualization 
50 of multiple experiments; 

Fig. 17 illustrates an intensity graph window for a selected t>ase; 
Fig. 18 illustrates multiple intensity graph windows for selected bases; 

Fig. 19 illustrates tiie intensity ratio method correctiy calling a mutation in solutions with varying concentrations; 
Fig. 20 illusti-ates the reference method correctiy calling a mutant base where the intensity ratio metiiod incorrectiy 
55 called the mutant base; and 

Fig. 21 illustrates the output of the ViewSeq^ program with four pretreatment samples and four posttreatment sam- 
ples. 
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In the description that follows, the present invention will be described in reference to a Sun Workstation in a UNIX 
environment. The present invention, however, is not limited to any particular hardware or operating system environment 
Instead, those skilled in the art will find that the systems and methods of the present invention may be advantageously 
applied to a variety of systems, including IBM personal computers running MS-DOS or Microsoft Windows. Therefore, 
the following description of specific systems are for purposes of illustration and not limitation. 

Fig. 1 illustrates an example of a computer system used to execute the software of the present invention. Fig. 1 
shows a computer system 1 which includes a monitor 3, screen 5. cabinet 7. keyboard 9. and mouse 1 1 . Mouse 1 1 may 
have one or more buttons such as mouse buttons 13. Cabinet 7 houses a floppy disk drive 14 and a hard drive (not 
shown) that may be utilized to store and retrieve software programs Incorporating the present invention. Although a 
floppy disk 15 is shown as the removable media, other removable tangible media including CD-ROM. flash memory and 
tape may be utilized. Cabinet 7 also houses familiar computer components (not shown) such as a processor memorv 
and the like. 

Fig. 2 shows a system block diagram of computer system 1 used to execute the software of the present Invention. 
As in Fig. 1. computer system 1 includes monitor 3 and keyboard 9. Computer system 1 further includes subsystems 
such as a central processor 52, system memory 54. IAD controller 56. display adapter 58, serial port 62, disk 64. network 
interface 66. and speaker 68. Disk 64 is representative of an internal hard drive, floppy drive. CD-ROM, flash memory, 
tape, or any other storage medium. Other computer systems suitable for use with the present invention may include 
additional or fewer subsystems. For example, another computer system could include more than one processor 52 (I e 
a multi-processor system) or memory cache. 

Arrows such as 70 represent the system bus architecture of computer system 1 . However, these arrows are illus- 
trative of any Interconnection scheme serving to link the subsystems. For example, speaker 68 could be connected to 
the other subsystems through a port or have an internal direct connection to central processor 52. Computer system 1 
shown in Fig. 2 is but an example of a computer system suitable for use with the present invention. Other configurations 
of subsystems suitable for use with the present Invention will be readily apparent to one of ordinary skill in the art. 

The VLSIPS™ technology provides methods of making very large an'ays of oligonudeolide probes on very small 
chips. See U.S. Patent No. 5,143.854 and PCT patent publication Nos. WO 90/15070 and 92/10092, each of which is 
incorporated by reference for all purposes. The oligonucleotide probes on the DNA probe array are used to detect 
complementary nucleic add sequences in a sample nudeic add of interest (the "target" nudeic add). 

The present invention provides methods of analyzing hybridization Intensity files for a chip containing hybridized 
nudeic acid probes. In a representative embodiment, the files represent fluorescence data from a biological array, but 
the files may also represent other data such as radioactive intensity data or large molecule detectton data. Therefore, 
the present invention Is not limited to analyzing fluorescent measurements of hybridizations but may be readily utilized 
to analyze other measurements of hybridizatfon. 

For purposes of illustration, the present invention is desaibed as being part of a computer system that designs a 
chip mask, synthesizes the probes on the chip, labels the nucleic acids, and scans the hybridized nudeic acid probes 
Such a system Is fully desaibed In U.S. Patent Application No. 08/249,188 which has been incorporated by reference 
for all purposes. However, the present invention may be used separately from the overall system for analyzing data 
generated by such systems. 

Fig. 3 illustrates a computerized system for forming and analyzing arrays of biological materials such as RNA or 
DNA. A computer 100 is used to design arrays of biological polymers such as RNA or DNA. The computer 100 may be, 
for example, an appropriately programmed Sun Workstation or personal computer or workstation, such as an IBM PC 
equivalent, induding appropriate memory and a CPU as shown in Figs. 1 and 2. The computer system 100 obtains 
inputs from a user regarding characteristics of a gene of interest, and other inputs regarding the desired features of the 
array Optionally, the computer system may obtain Information regarding a specific genetic sequence of Interest from an 
external or internal database 102 such as GenBank. The output of the computer system 100 is a set of chip design 
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computer files 1 04 in the form of. for example, a switch matrix, as described in PCT application WO 92/1 0092, and other 
associated conputer f Oes. 

The chip design files are provided to a system 106 that designs the lithographic masks used in the fabrication of 
arrays of molecules such as DNA. The system or process 106 may include the hardware necessary to manufacture 

5 masks 1 1 0 and also the necessary computer hardware and software 1 08 necessary to lay the mask patterns out on the 
mask in an efficient manner. As with the other features in Fig. 3. such equipment may or nnay not be located at the same 
physical site, but is shown together for ease of illustration in Fig. 3. The system 106 generates masks 110 or other 
synthesis patterns such as chrome-on-glass masks for use In the fabrication of polymer arrays. 

The masks 1 10. as well as selected information relating to the design of the chips from system 100, are used in a 

10 synthesis system 112. Synthesis system 112 includes the necessary hardware and software used to fabricate arrays of 
polymers on a suk>strate or chip 1 14. For example, synthesizer 1 12 includes a light source 1 16 and a chemical flow cell 
1 18 on which the suk>strate or chip 1 14 is placed. Mask 1 10 is placed between the light source and the substrate/chip, 
and the two are translated relative to each other at appropriate times for deprotection of selected regions of the chip. 
Selected chemical reagents are directed through flow cell 1 1 8 for coupling to deprotected regions, as well as for washing 

IS and other operations. All operations are preferably directed by an appropriately programmed computer 119, which may 
or may not be the same computer as the computer(s) used in mask design and mask making. 

The substrates fabricated by synthesis system 1 12 are optionally diced into smaller chips and exposed to marked 
receptors. The receptors may or may not be conplementary to one or more of the molecules on the substrate. The 
receptors are marked with a label such as a fluorescein label (indicated by an asterisk in Fig. 3) and placed in scanning 

20 system 1 20. Scanning system 1 20 again operates under the direction of an appropriately programmed digital computer 
122, which also may or may not be the same computer as the computers used in synthesis, mask making, and mask 
design. The scanner 120 includes a detection device 124 such as a confbcal microscope or CCD (charge-coupled 
device) that is used to detect the locations where labeled receptor (*) has bound to the substrate. The output of scanner 
120 is an image file(s) 124 indicating, In the case of fluorescein labeled receptor, the fluorescence intensity (photon 

25 counts or other related measurements, such as voltage) as a function of position on the substrate. Since higher photon 
counts will be observed where the labeled receptor has bound more strongly to the array of polymers, and since the 
monomer sequence of the polymers on the substrate is known as a function of position, it becomes possible to determine 
the sequence(s) of polymer(s) on the substrate that are complementary to the receptor. 

The Image file 124 Is provided as input to an analysis system 126 that incorporates the visualization and analysis 

30 methods of the present Invention. Again, the analysis system may be any one of a wide variety of computer system(s). 
but in a preferred embodiment the analysis system is based on a Sun Workstation or equivalent. The present invention 
provides various methods of analyzing the chip design files and the image files, providing appropriate output 128. The 
present invention may further be used to identify specific mutations In a receptor such as DNA or RNA. 

Fig. 4 provKles a simplified illustration of the overall software system used in the operation of one embodiment of 

35 the invention. As shown in Rg. 4, in some cases (such as sequence checking systems) the system first identifies the 
genetic 8equence(s) or targets that would be of Interest in a particular analysis at step 202. The sequences of Interest 
may, for example, be normal or mutant portions of a gene, genes that identify heredity, or provide forensic information, 
or be ail possible n-mers (where n represents the length of the nucleic acid). Sequence selection may be provided via 
manual input of text files or may be from external sources such as GenBank. At step 204 the system evaluates the gene 

40 to determine or assist the user In determining which probes would be desiratde on the chip, arxi provides an appropriate 
"layout" on the chip for the probes. The chip usually includes probes that are complementary to a reference nucleic ackJ 
sequence which has a known sequence. A wild-type probe is a probe that will ideally hybridize with the reference 
sequence and thus a wikt-type gene (also called the chip wild-type) would ideally hybrkiize with wild-type probes on the 
chip. Tlie target sequence is substantially similar to the reference sequence except for the presence of mutations, inser- 

45 Hons, deletions, and the like. The layout implements desired characteristics such as anrangement on the chip that permits 
"reading" of genetic sequence and/or minimization of edge effects, ease of synthesis, and the like. 

Rg. 5 illustrates the global layout of a chip in a particular embodiment used for sequence checking applications. 
Chip 1 14 is composed of multiple units where each unit may contain different tilings for the chip wild-type sequence. 
Unit 1 is shown in greater detail and shows that each unit is composed of multiple cells which are areas on the chip that 

50 may contain probes. Conceptually, each unit is composed of multiple sets of related cells. As used herein, the term cell 
refers to a region on a substrate that contains many copies of a molecule or molecules of interest Each unit is composed 
of multiple cells that may be placed in rows (or "lanes") and columns. In one embodiment, a set of five related cells 
includes the following: a wild-type cell 220, "mutation" cells 222. and a "blank" ceil 224. Cell 220 contains a wild-type 
probe that Is the complement of a portion of the wild-type sequence. Cells 222 contain "mutation" probes for the wild- 

55 type sequence. For example, if the wild-type probe is 3'-ACGT the probes 3 -ACAT, 3'-ACCT, 3'-ACGT, and 3*-ACTT 
may be the "mutation" probes. Cell 224 is the "blank" cell because it contains no probes (also called the "blank" probe). 
As the blank cell contains no probes, labeled receptors should not bind to the chip in this area. Thus, the blank cell 
provkies an area that can be used to measure the t)ackground intensity. 
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In embodment. numerous tiling processes are available including sequence tiling, block tiling, and opt-tiling 
as de«;nbed betow. Of course a wide range of layout strategies may be used according to the invention herein, without 
departing from the scope of the invention. For example, the probes may be tiled on a substrate in an apparently random 
fashion where a computer system is utilized to keep track of the probe locations and correlate the data obtained from 
the substrate. 

«**'«P^'»essoftilingadditionalprobestor8uspectedmutations. Asasimpleexampleof opt-tiling, suppose 
-!? '^u*"®^^'*'^* ff-ACQTATCiCA-S' and it is suspected that a mutant sequence has a possible T Se 
mutation at the underlined base position. Suppose further that the chip will be synthesized with a "4x3" tiling strategy 
meaning that probes of four monomers are used and that the monomers in position 3. counting left to right, of the ixobe 
are varied. 

In opt-tiling. extra probes are tiled for each suspected mutation. The extra probes are tiled as if the mutation base 
IS a wiM-^ base. The following shows the probes that m^ be generated for this example: 



Table 1 



Probe Sequences (From 3'-end) 4x3 Opt-Tiling 



Wild 


TGCA 


GCAT 


CATA 


ATAC 


TACG 


A sub. 


TGAA 


GCAT 


CAAA 


ATAC 


TAAG 


Csub. 


TGCA 


GCCT 


CACA 


ATCC 


TACG 


Qsub. 


TGGA 


GCGT 


CAGA 


ATQC 


TAGG 


Tsub. 


TGTA 


GCTT 


CATA 


ATTC 


TATG 


Wild 


TGCA 


GCAA 


CAAA 


AAAC 


AACG 


A sub. 


TGAA 


GCAA 


CAAA 


AAAC 


AAAG 


Csub. 


TGCA 


GCCA 


CACA 


AACC 


AACG 


Qsub. 


TGGA 


GCGA 


CAGA 


AAGC 


AAGG 


Tsub. 


TGTA 


GCTA 


CATA 


AATC 


AATG 



In ttje first "chip" above, the top row of the probes (along with one probe below each of the four wiU-type probes) shouW 
bind to the target DNA sequence. However, if the target sequence has a T base notation as suspected, the labeled 
mutant sequence will not bind that strongly to the probes in the columns around column 3. For example, the mutant 
receptor that could bind with the probes in column 2 is 5'-CGTT which may not bind that strongly to any of the probes 
in column 2 because there are T bases at the ends of the receptor and probes (i.e.. not complementary). This often 
results in a relatively dark scanned area around a mutation. 

Opt-tiling generates the second "chip" above to handle the suspected mutation as a wiW-type base. Thus, the mutant 

receptor 5-CGTT should bind stronglylD the wiW-type probe of column2(along with one probe betow) and the 
can be further detected. 

Again referring to Fig. 4. at step 206 the masks for the synthesis are designed. At step 208 the software utilizes the 
mask design and layout infbrmatfon to make the DNA or other polymer chips. This software 208 will control relative 
trai«lation of a substrate and the mask, the ftow of desired reagents through a flow cell, the synthesis tenuerature of 
the flow cell, and other parameters. At step 210. another piece of software is used in scanning a chip thus synthesized 
and exposed to a labeled receptor. The software controls the scanning of the chip, and stores the data thus obtained in 
a file that may later be utilized to extract sequence infbrmatioa 

At step212acomputer system accordingtothepresent invention utilizes the layout information and the fluorescence 
iriformation to evaluate the hybrWized nucleic acid probes on the chip. Among the important pieces of information 
obtained from probe arrays are the klentification of mutant receptors and determination of genetk: sequence of a oar- 
ticular receptor. 

Fig. 6 lllusb^ates the binding of a particular terget DNA to an array of DNA probes 114. As shown in this sinple 
example, the toltowing probes are formed in the anay (only one probe is shown for the wild-type probe): 
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3 • -AGAACGT 
AGACCGT 
AGAGCGT 
AGATCGT 



10 



As shown, the set of probes differ by only one base so the probes are designed to determine the identity of the base at 

75 that position In the nucleic add sequence. 

When a f luorescein-labeled (or otherwise marked) target with the sequence 5'-TCTTGCA is exposed to the amy, 
it is cornplementary only to the probe 3 -AGAACGT. and fluorescein will be primarily found on the surfece of the chip 
where 3*-AGAAGGT is located. Thus, for each set of probes that differ by only one base, the image file will contain four 
fluorescence intensities, one for each probe. Each fluorescence intensity can therefore be associated with the base of 

20 each probe that is different from the other probes. Additionally, the Image file will contain a ''blank'* cell which can be 
used as the fluorescence intensity of the backgrourxJ. By analyzing the five fluorescence intensities associated with a 
specific base location, it becomes possible to extract sequence information from such arrays using the methods of the 
invention disclosed herein. 

Fig. 7 illustrates probes an^anged in lanes on a chip. A reference sequence is shown with five interrogation positions 

25 marked with number subsaipts. An interrogation position is a base position in the reference sequence where the target 
sequence may contain a mutation or otherwise differ from the reference sequence. The chip may contain five probe cells 
that conrespond to each interrogation position. Each probe cell contains a set of probes that have a common base at 
the interrogation position. For example, at the first interrogation position, h, the reference sequence has a base T. The 
wild-type probe for this inten^ogation position is 3'-TGAG where the base A in the probe is complementary to the base 

30 at the interrogation position in the reference sequence. 

Similarly, there are four "mutant" probe cells for the first interrogation position, h. The four mutant probes are 3 - 
TGAC, 3-TGCC, 3'-TGGC. and 3'-TGTG. Each of the four nurtant probes vary by a single base at the interrogation 
position. As shown, the wild*type and mutant probes are arranged in lanes on the chip. One of the mutant probes (in 
this case 3'-TGAC) is identical to the wild-type probe and therefore does not evidence a mutation. However, the redun- 

35 dancy gives a visual indication of mutations as will be seen in Fig. 8. 

Still referring to Fig. 7, the chip contains wild-type and mutant probes for each of the other interrogation positions 
Ir 's- In each case, the wild-type probe is equivalent to one of the mutant probes. 

Fig. 8 illustrates a hybridization pattern of a target on a chip with a reference sequence as in Fig. 7. The reference 
sequence is shown along the top of the chip for comparison. The chip includes a WT-lane (wild-type), an A-lane, a C- 

40 lane, a G-lane. and a T-lane (or U) . Each lane is a row of cells containing probes. The cells in the WT-lane contain probes 
that are complementary to the reference sequence. The cells in the A-. G-. G-. and T-lanes contain probes that are 
complementary to the reference sequence except that the named base is at the interrogation position. 

In one embodiment, the hybridization of probes in a cell is determined by the fluorescent intensity (e.g., photon 
counts) of the celt resulting from the binding of marked target sequences. The fluorescent intensity may vary greatly 

45 among cells. For simplicity. Fig. 8 shows a high degree of hybridization by a cell containing a darkened area. The WT- 
lane allows a simple visual indicatk)n that there is a mutation at interrogation position U because the wild-type cell Is not 
dark at that position. The cell in the G-lane is darkened whrch indicates that the mutation is from T->G (mutant probe 
cells are complementary so the C-cell indicates a G mutation). 

In practice, the fluorescent intensities of cells near an interrogation position having a mutation are relatively dark 

so creating "dark regions" around a mutation. The lower fluorescent intensities result because the cells at interrogation 
positions near a mutation do not contain probes that are perfectly conplementary to the target sequence; thus, the 
hybridization of these probes with the target sequence is lower. For example, the relative intensity of the cells at inter- 
rogation positions I3 and I5 may be relatively low because none of the probes therein are complementary to the target 
sequence. 

55 
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For ease of reference, one may call bases by assigning the bases ttie following codes: 



Code 


Group 


Meaning 


A 


A 


Adenine 


C 


C 


Cytosine 


G 


G 


Guanine 


T 


T{U) 


Thymine (Uracil) 


M 


AorC 


aMIno 


R 


AorG 


puRine 


W 


AorT(U) 


Wealc interaction (2 H bonds) 


Y 


CorT(U) 


pYrimidine 


S 


CorG 


Strong interaction (3 H bonds) 


K 


G or T(U) 


Keto 


V 


A. C or G 


notT(U) 


H 


A, CorT(U) 


notG 


D 


A.GorT(U) 


note 


B 


C.GorT(U) 


not A 


N 


A. C, G. orT(U) 


Insufficient intensity to call 


X 


A. C. G. or T{U) 


Insuffident discrimination to call 



Most of the codes conform to the lUPAC standard. However, code N has been redefined and code X has been added. 
IL Intensltv Ratio Method 

The intensity ratio method is a method of calling bases in a sample nucleic acid sequence. The Intensity ratio method 
is most accurate when there is good discrimination between the fluorescence Intensities of hybrid matches and hybrid 
mismatches. If there is Insufficient discrimination, the intensity ratio method assigns a corresponding ambiguity code to 
the unknown base. 

For simplicity, the intensity ratio method will be described as being used to identify one unknown base in a sample 
nucleic acid sequence. In practice, the method is used to identify many or all the bases in a nucleic acid sequence. 

The unknown base will be identified by ey^aluation of up to four mutation probes and a "blank" cell, which Is a location 
where a labeled receptor shouW not bind to the chip since no probe is present. For example, suppose a DNA sequence 
of interest or target sequence contains the sequence 5'-AGAA£CTGC-3' with a possible mutation at the underlined base 
position. Suppose that 5-mer probes are to be synthesized for the target sequence. A representative wild-type probe of 
5'-TTGGA is complementary to the region of the sequence around the possible mutation. The "mutation" probes will be 
the same as the wild-type probe except for a different base at the third position as follows: 3*-TTAQA, 3'-TTCGA 3'- 
TTGGA, and 3'-TTTGA. 

If the f luorescently marked sample sequence Is exposed to the above four mutation probes, the intensity should be 
highest for the probe that binds most strongly to the sample sequence. Therefore, if the probe 3*-TTTGA shows ttie 
highest Intensity, the unknown base in the sample will generally be called an A mutation because the probes are com- 
plementary to the sample sequence 

The mutation probes are kJentical to the wild-type probes except that they each contain one of the four A. C, G, or 
T "mutations" for the unknown base. Although one of the "mutation" probes will be identical to the wild-type probe, such 
redundant probes are intentionally synthesized for quality control and design consistency 

The identity of the unknown base is preferably determined by evaluating the relative fluorescence intensities of up 
to four of the mutation probes, and tine "Wank" cell. Because each mutation probe is Identifiable by the mutation base, 
a mutation probe's Intensity will be referred to as the "base intensity" of the mutation base. 

As a simple example of the intensity ratio method, suppose a gene of interest (target) is an HIV protease gene with 
the sequence 5'-ATGTGGACAGTTGTA-3' (SEQ ID N0:1). Suppose further that a sample sequence is suspected to 
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have the same sequence as the target sequence except for a mutation of base C to b^e T at the underlined base 
position. Although hundreds of probes may be synthesized on the chip, the complementary mutation probes synthesized 
to detect a mutation in the sample sequence at the suspected mutation position may be as follows: 

3'-TATC 
5 3'-TCTC 

3'-TGTC (wild-type) 

S'-TTTC 

The mutation probe 3*-TGTC is also the wild-type probe as it should bind most strongly with the target sequence. 
After the sample sequence is labeled, hybridized on the chip, and scanned, suppose the following fluorescence 
10 intensities were obtained: 
3'-TATC->45 
3'-TCTC->8 
3'-TGTC -> 32 
3'-TTTC->12 

IS where the intensity is measured by the photon count detected by the scanner. The "blank" cell had a fluorescence 
intensity of 2. The photon counts in the examples herein are representative (not actual data) and provided for Illustration 
purposes. In practice, the actual photon counts will vary greatly depending on the experiment parameters and the scanner 
utilized. 

Although each fluorescence Intensity is from a probe, the probes may be characterized by their unique mutation 
20 base so the bases may be said to have the following intensities: 
A->45 
C->8 
G -> 32 
T->12 

25 Thus, base A will be desaibed as having an intensity of 45, which con^esponds to the intensity of the mutation probe 
witii the mutation base A. 

Initially, each mutation base intensity is reduced by the background or "blank" cell intensity. This is done as follows: 

A-> 45-2 = 43 

30 

C->8-2 = 6 
Q -> 32 - 2 » 30 

35 T->12-2 = 10 

Then, tiie base intensities are sorted in descending order of Intensity. The above bases would be sorted as follows: 
A.>43 
G->30 
40 T->10 
C->6 

Next the highest intensity base is compared to the second highest intensity base. Thus, the ratio of the Intensity of base 
A to the Intensity of base G is calculated as follows: A:G = 43 / 30 = 1 .4. The ratio A:G Is then compared to a predetermined 
ratio cutoff, which is a number tiiat specifies tiie ratio required to identify the unknown base. For exanple, if the ratio 
45 cutoff is 1.2, the ratio A:G is greater than the ratio cutoff (1.4 > 1.2) and the unknown base is called by the mutation 
probe containing the mutation A. As probes are complementary to ttie sample sequence, tiie sanple sequence is called 
as having a mutation T, resulting in a called sample sequence of 5*-ATGTGGAIAGTTGTA-3' (SEQ ID N0:2). 

As another example, suppose everything else is the same as in tiie previous example except ttiat the sorted back- 
ground adjusted intensities were as follows: 
so C->42 
A->40 
G->10 
T->8 

The ratio of the highest intensity base to the second highest intensity base (C:A) is 1 .05. Because this ratio is not greater 
55 than the ratio cutoff of 1 .2, the unknown b^e will be called as being ambiguously one of two or nrrare bases as follows. 
The second highest intensity base is then compared to the third highest base. The ratio of A:G Is 4. The ratio of A:0 
is then compared to the ratio cutoff of 1 .2. As the ratio A:G Is greater than the ratio cutoff (4 > 1 .2). the unknown base 
is called by the mutation probes containing the mutations C or A. As probes are complementary to the sample sequence, 
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the sample sequence is called as having either a mutation G or T. resulting in a sample sequence of 5*-ATQTGGAl^ 
AGTTGTA-3' (SEQ ID NO:3) where K is the lUPAC code for G or T(U). 

The ratio cutoff in the previous examples was equal to 1 .2. However, the ratio cutoff will generally need to be adjusted 
to produce optimal results for the specific chip design and wild-type target. Also, although the ratio cutoff used has been 
5 the same for each ratio comparison, the ratio cutoff may vary depending on whether the ratio comparisons involve the 
highest, second highest third highest, etc. intensity base. 

Fig. 9 illustrates the high level flow of the Intensity ratio method. At step 302 the four base intensities are adjusted 
by subtracting the background or "blank" cell intensity from each base intensity. Preferably, if a base intensity is then 
less than or equal to zero, the base intensity is set equal to a small positive number to prevent division by zero or negative 
10 numbers in future calculations. 

At step 304 the base intensities are sorted by intensity. Each base is then associated with a number from 1 to 4. 
The base with the highest intensity is 1, second highest 2. third highest 3. and fourth highest 4. Thus, the intensity of 
base 1 ^ base 2 ^ base 3 ^ base 4. 

At step 306 the highest intensity base (base 1) is checked to see if it has sufficient intensity to call the unknown 
IS base. The intensity is checked by determining if the intensity of base 1 is greater than a predetermined background 
difference cutoff. The background difference cutoff Is a number that specifies the Intensity a base intensity must be over 
the background intensity in order to correctly call the unknown base. Thus, the background adjusted base intensity must 
be greater than the background difference cutoff or the unknown is not callable. 

If the intensity of base 1 Is not greater than the k)ackground difference cutoff, the unknown base is assigned the 
20 code N (insufficient intensity) as shown at step 308. Othenwise, the ratio of the intensity of base 1 to base 2 is cafoulated 
as shown at step 310. 

At Step 312 the ratio of intensity of bases 1 :2 is compared to the ratio cutoff. If the ratio 1 :2 Is greater than the ratio 
cutoff, the unknown base is called as the complement of the highest intensity base (base 1) as shown at step 314. 
Othenfvlse. the ratio of the intensity of base 2 to base 3 is calculated as shown at step 316. 

ss At step 31 8 the ratio of intensity of bases 2:3 is compared to the ratio cutoff. If the ratio 2:3 is greater than the ratio 
cutoff, the unknown base is called as being an ambigiity code specifying the complements of the highest or second 
highest Intensity bases (base 1 or 2) as shown at step 320. Othenwise. the ratio of the Intensity of base 3 to base 4 is 
calculated as shown at step 322. 

At step 324 the ratio of Intensity of bases 3:4 is compared to the ratio cutoff. If the ratio 3:4 Is greater than the ratio 

30 cutoff, the unknown base Is called as being an ambiguity code specifying the complements of the highest, second 
highest, or third highest bases (base 1. 2 or 3) as shown at step 326. Othenrvise. the unknown base is assigned the 
code X (insufficient discrimination) as shown at step 328. 

The advantage of the intensity ratio method is that It is very accurate when there Is good discrimination between 
the fluorescence intensities of hybrid matches and hybrid mismatches. However, If the base corresponding to a correct 

35 hybrid gives a lower intensity than a mismatch (e.g.. as a result of cross-hybridlzatfon), inoon-ecl identification of the 
base will result. For this reason, however, the method is useful for comparative assessment of hybridization quality and 
as an Indicator of sequence-specific problem spots. For example, the Intensity ratio method has been used to determine 
that ambiguities and miscalls tend to be very different from sequence to sequence, and reflect predominantly the com- 
position and repetltiveness of the sequence. It has also been used to assess improvements obtained by varying hybrid- 

40 izatfon conditions, sample preparation, and post*hybridization treatments (e.g., RNase treatment). 

III. Reference Method 

The reference method is a method of calling bases in a sample nucleic acid sequence. The reference method 
45 depends very little on discrimination between the fluorescence intensities of hybrid matches and hybrid mismatches, 
and therefore is much less sensitive to aoss-hybridization. The method conrpares the probe Intensities of a reference 
sequence to the probe intensities of a sample sequence. Any significant changes are flagged as possible mutations. 
There are two implementations of the reference method disclosed herein. 

For simplicity, the reference method will be described as being used to identify one unknown base in a sanple 
50 nucleic acid sequence. In practice, the method is used to identify many or alt the bases in a nucleic acid sequenca 

The urrioiown base will be called by comparing the probe intensities of a reference sequence to the probe intensities 
of a sample sequence. Preferably, the probe intensities of the reference sequence and the sample sequence are from 
chips having the same chip wild-type. However, the reference sequence may or may not be exactly the same as the chip 
wild-type, as it may have mutations. 
55 The bases at the same position in the reference and sample sequences will each be associated with up to four 
mutation probes and a 'blank" cell. The unknown base in the sample sequence Is called by comparing probe intensities 
of the sample sequence to probe intensities of the reference sequence. For example, suppose the chip wild-type contains 
the sequence 5'-AGAC£TTGC-3' and it is suspected that the sample has a possible mutation at the underlined base 
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position, which Is the unknown base that will be called by the reference method. The "mutation" probes for the sample 
sequence may be as Ibtlows: 3*-GAAA. 3*-GCAA. 3'-GGAA. and 3'-GTAA. where 3*-GGAA is the wild-type probe. 

Suppose further that a reference sequence, which differs from the chip wild-type by one base mutation, has the 
sequence 5*-AGACATTGC-3* where the mutation base is underlined. The "mutation" probes for the reference sequence 

5 may be as follows: 3-TQAAA. 3*-TQCAA, 3'-TGGAA, and 3'-TGTAA. where 3*-TQTAA is the reference wild-type probe 
since the reference sequence is known. Afthough generally the sample and reference sequences were tiled with tiie 
same chip wikl-type. this is not required, and tiie tiling methods do not have to be klentical as shown by the use of two 
probe lengths in the example. Thus, the unknown base will be called by comparing the "mutation" probes of the sample 
sequence to the "mutation" probes of the reference sequence. As before, because each mutation probe is identifiable 

10 by the mutation base, the nutation prot>es* intensities will be referred to as the "l>ase Intensities" of their respective 
mutation bases. 

As a simple example of one implementation of the reference method, suppose a gene of interest (target) has the 
sequence 5'-AAAACTGAAAA-3' (SEQ ID N0:4). Suppose a reference sequence has the sequence 5*-AAAAC^AAAA- 
3' (SEQ ID N0:5). which differs from the target sequence by the underlined base. The reference sequence is marked 
75 and exposed to probes on a chq) with the target sequence being tiie chip wild-type. Suppose further that a sample 
sequence is suspected to have tiie same sequence as the target sequence except for a mutation at the underlined base 
position in 5'-AAAACIGAAAA-3' (SEQ ID N0:4). The sample sequence is also marked and exposed to probes on a 
chip witii tiie target sequence being tiie chip wild-type. After hybridization and scanning, tiie following probe intensities 
(not actual data) were found for the respective complementary probes: 

20 



25 



Reference 


Sample 


3'-TGAC->12 
3'-TGCC ->9 
3*-TGGC ->80 
3'-TGTC->15 


3-QACT->11 
3'-GCCT -> 30 
3'-GGCT ->60 
3'-GTCT->6 



Although each fluorescence intensity is from a probe, the probes may be identified by tiieir unique mutation base so the 
bases may be saki to have the following intensities: 

35 



Reference 


Sample 


A->12 


A->11 


C->9 


C->30 


G->80 


G->60 


T.>15 


T->6 



45 ! 

Thus, base A of tiie reference sequence will be described as having an intensity of 1 2. which corresponds to the intensity 
of the mutation probe with ttie mutation base A, The reference method will now be described as calling the unknown 
base in tiie sample sequence by using these intensities. 

Fig. 10A illusti^ates tiie high level flow of one implementation of the reference metiiod. For illustration purposes, the 

50 reference method is desaibed as filling in the columns (identified by the numbers along the bottom) of tiie analysts table 
shown in Fig. 10B. However, the generation of an analysis tattle is not necessary to practice the method. The analysis 
table is shown to aid the reader in understanding the metiiod. 

At step 402 the four base intensities of the reference and sample sequences are adjusted by subtracting the back- 
ground or "blank" cell intensity from each base intensity. Each set of "mutation" probes has an associated "blank" ceil. 

55 Suppose that the reference "blank" cell intensity is 1 and the sample l3lank" cell intensity is 2. The base intensities are 
tiien background subtracted as follows: 
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5 


Reference 


Sample 




A ■> 12-1 =11 


A->11 -2 = 9 




C->9-1 =8 


C -> 30 - 2 = 28 




Q->80-1s79 


Q-> 60-2 = 58 


10 


T->15-1 =14 


T -> 6 - 2 = 4 



15 



20 



25 



30 



Preferably, a base intensity Is then less than or equal to zero, the base Intensity is set equal to a small positive number 
to prevent division by zero or negative numbers in future calculations. 

1 T '"^^'^j^^- ^If P°^*^" °^ each base of interest in the ref^ence and sample sequences is placed in column 

for Site (SiSjl ^ "^""^ ^ °* ^"^'y^^ "'^'ch is C 

"^^^^ associated with the reference wild-type (column 2 of the analysis table) is checked 
to see rf It has sufficient intensity to call the unknown base. In this exanple, the reference wOd-type Is C. However the 
taseirtensity associated with the wild-type is theGbaseirtensity,which« 

V"^^ complementary "mutation" probes. The G base intensity is checked by determining if 
te intensity is greater than a predetermined background difference cutoff. The background difference cutoff is a number 
that specrfies the intensity the base intensities must be above the background intensity in order to correctly call the 
unknown base^Thus. the base intensity associated with the reference wild-type must be greater than the background 
difference cutoff or the unknown base is not callable. w « me oacRgrouna 

i„.onLl!l%? "S'"""^.? i^'T* " ®' *^ assodaled with the reference wild-type has sufficient 

c T^i * ^ '® P'®*^ 3 of the analysis table as shown at step 406. Othenwise. at step 407 

an F (fall) IS placed in colunm 3 of the analysis table. . 

r^irHt^^S'i^ '"^^"^ associated with the reference wild-type to each of the possible bases are 

cateulated. The ratio of the base intensity associated with the reference wild-type to HseK will be 1 and the other ratios 
ScufatS ^ • The base intensity associated with the reference wiW-type is G so the following ratios a^ 



35 



G:A-> 79/11 = 7.2 



G:C ■> 79/8 = 9.9 



40 



45 



SO 



G:G-> 79 / 79 = 1.0 
G:T-> 79/14 = 5.6 

These ratios are placed In columns 4 through 7 of the analysis table, respectively 

i«*.n?^f^ ^^^^ associated with the sample sequence is checked to see if It has sufficient 

rtensity to call tfie unknown base. The highest base intensity is checked by determining if the intensity is greater than 
ttiebackgraund difference cutoff^ 
or the unknown base is not callable. 

i«».n^-i!?J ^^^.^^^Q;^""^ difference cutoff is 5. the highest base intensity, which Is G in this example, has sufficient 

c ^f?.^ ^^^"^ ® ^ ^"^y^^ shown at step 412. Otherwise, at step 413 

an F (feiil) IS placed in column 8 of the analysis table. 

^[^a^^^.'^t ^® °^ ^'9^®^ *"^«"sity of the sample to each of the possible bases are calculated The 
ratio of the highest base Intensity to itself will be 1 and the other ratk)s will usually be greater than 1. Thus, the highest 
base intensity ts G so the following ratios are calculated: 



55 



G:A -> 58/9 = 6.4 



G:C-> 58/28 = 2.3 
G:G-> 58/58 =1.0 
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G:T->58/4«14.5 

These ratios are placed in columns 9 through 12 of the analysis table, respectively. 

At step 416 if both the reference and sample sequence probes ^lled to have sufficient intensity to call the unknown 
5 base, meaning there is an 'P in columns 3 and 8 of the analysis table, the unknown base is assigned the code N 
(insufficient intensity) as shown at step 41 8. An 'N* is placed in column 1 7 of the analysis table. Additionally, a confidence 
code of 9 is placed in column 18 of the analysis table where the confidence codes have the f6!k>wing meanings: 



10 



Code 


Meaning 


0 


Probable reference wild-type 


1 


Probable mutation 


2 


Reference sufficient intensity, insuff ident intensity in sample suggests possible mutation 


3 


Borderline differences, unknown base ambiguous 


4 


Sample sufficient intensity, insufficient intensity in reference to allow comparison 


5-8 


Currently unasslgned 


9 


Insufficient intensity in reference and sample, no interpretation possible 



The confidence codes are useful for indicating to the user the resulting analysis of the reference method. 
25 At step 420 if only the reference sequence probes failed to have sufficient intensity to call the unknown base, meaning 
there is an 'P in column 3 and a 'P' in column 8 of the analysis table, the unknown base is assigned the code N (insufficient 
intensity) as shown at step 422. An 'N* is placed in column 17 and a confidence code of 4 is placed in column 18 of the 
analysis table. 

At step 424 if only the sanple sequence probes failed to have sufficient intensity to call the unknown base, meaning 
30 there is a 'P' in column 3 and a 'F' in column 8 of the analysis table, the unknown base is assigned the code N (insufficient 
intensity) as shown at step 426. An *N* is placed in column 1 7 and a confidence code of 2 is placed in column 18 of the 
analysis table. 

In this example, both the reference and sample sequence probes have sufficient intensity to call the unknown base. 
At step 428 the ratios of the reference ratios to the sample ratios for each base type are calculated. Thus, the ratio A:A 
35 (column 4 to column 9) is placed in column 13 of the analysis table. The ratio C:C (column 5 to column 10) is placed in 
column 14 of the analysis table. The ratio G:G (column 6 to column 1 1) is placed in column 15 of the analysis table. 
Lastly, the ratio T:T (column 7 to column 12) is placed in column 16 of the analysis table. These ratios are calculated as 
folk>w8: 

40 A:A->7.2/6.4 = 1.1 

C:C -> 9.9/2.3 = 4.3 
G:G-> 1.0/1.0 = 1.0 

45 

T:T-> 5.6/14.5 = 0.4 

The unknown base is called by comparing these ratios of ratios to two predetermined values as follows. 

At step 430 if all the ratios of ratios (columns 13 to 16 of the analysis table) are less than a predetermined lower 
so ratio cutoff, the unknown base is assigned the code of the reference wild-type as shown at step 432. Thus, the code for 

the reference wild-type (as shown in column 2) would be placed in column 17 and a confidence code of 0 would be 

placed in column 18 of the analysis table. 

At step 434 if all the ratios of ratios are less than a predetermined upper ratio cutoff, the unknown base Is assigned 

an ambiguity code that indicates the unknown base may be any one of the bases that has a complementary ratio of 
55 ratios greater than the lower ratio cutoff and less than the upper ratio cutoff as shown at step 436. Thus, ff the ratio of 

ratios for A A C:C and Q:G are all greater than the lower ratk> cutoff and less than the upper ratio cutoff, the unknown 

base wouki be assigned the code B (meaning "not A"). This is because the ratios of ratios are complementary to their 

respective base as follows: 
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A:A->T 
C:C -> Q 
Q:Q->C 



aTT ' ^ ^"^ ^ °' 3 Pla^^ed in column 18 of the analy^steSS 

H,^ . f "® ® greater than the upper ratio cutoff and the unknown base is called 

Z^^^ Z"!^T^Z *° ^'''f * of ratloa The code for the base complementary to the highest ratio of 
ratios would pteced in column 1 7 and a confidence code of 1 v««ild be placed in column 18 



A:A-> 1.1 
C:C -> 4.3 
G:Q-> 1.0 
T:T -> 0.4 



if th! S!h«rf Z f'^T "^s* «he base complementary 

£?1 ™S ? K : ^ °' '"""^ ^ « conplementary base G. Thus, the unknot 

base .3 calledGwhich IS placed .n column 17andacorrfidence code of lis placed In col^^^ 
ahJ.!^"^ shows how the unknown base in the sample nudeic add sequence was correctly called as base G 

ss^JSirrerr^s^is''"^ 

^JS- ^ ^i'!""*^ ®s "I'S^ ♦'ow o* a"o<her implementation of the reference method. As in the previous imple- 

a S^ir^urroT'^**"*? "^.^S^ °* « '^^^''^ to P^obe intensities of 

S„!fZJ^"! "^«''«^<« differs conceptually from the previous implementation in that neigh- 

boring probe intensities are also analyzed, resulting in more accurate base calling 

?ATC^^SmJ^Nn^''T'' i^'"" ""^'^^ ' ^"^^ "^"^ « S's'-AAAcSJIS-^ 
SflSt / . ^' ^''^'^ "^"^ underlined. Thus, there is a mutatton of A to G. Suppose further 

that the reference and sample sequences are tiled on chips with the reference sequence bang the chip wiW-type This 
implanentation of the reference method will be described as idertifying this mutSo^ P wiifl type. This 

• p- . ! o ?f^*'?rf]".^®®' '"^'ementation of the reference method is described as filling in a data table shown 
feriLl 1 P ^.''^f: '° Although the data table contains mJedatath^^e^uSS 

fo^ this implementation, the portions of the data table that are produced by steps in Rfl.liAare shown with 
sSX^r mlSH"*;?.''"?!": « «^ « ~« nece^. •«««'er^ is sLvn to aid threSerTn uSTr 
^ SjTnSfdJS'tJS^"* "^"^ " '^'^ ^ ^ - 

«r T^f ^ If'® intensities of the reference and sample sequences are adjusted by subtracting the bad<ground 
w '"•^^^'y- P'«««f«"y. i< a base intensity is then less than or equal to zero the 

Si ^r^* !f If"^ ""^"^ ^^'o <^ "egative number. In the data t^ble 

K w "^"^ '"^^"^ ^^^^^""^ and data 502B is the bad<ground 

t„ JrJ!l?„S the base intensity associated with the reference wild-type is checked to see if it has sufficient intensity 

T ^® *'^®'^'®'*^®'®*®'®"=®'*^"-*yP®«'»seAa^ 
wim me reference wild-type is identified by a lower case "a" in the left hand column. Thus, the base intensities in the 
X^^^n^Ii^ll 51"^ ' ^^f-^^-^^^ and the reference wiW-type at the mutation position has an intensity 
of 385. The reference wiW-type intensity of 385 is chedied by determining if its intensity is greater than a predetermined 

T ^^"""^ '"^^^ ""^^ *° "^'^y """^""^ '^sa Thus, the base intensity associated 
with the refererKe wild-type must be greater than the bad^ground difference cutoff or the unknown base is not callable 
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if the base intensity associated with the reference wild-type is not greater than the background difference cutoff, the 
wild-type sequence would f^il to have sufficient intensity as shown at step 506. Othenwise. at step 508 the wild-type 
sequence would pass by having sufficient intensity. 

At step 510 calculations are performed on the background subtracted base intensities of the reference sequence 
5 in order to "normalize" the intensities. Each position in the reference sequence has four backgrourd subtracted base 
intensities associated with it. The ratio of the intensity of each base to the sum of the intensities of the possible bases 
(all four) is calculated, resulting in four ratios, one for each base as shown in the data table. Thus, the following ratios 
would be calculated at each position in the reference sequence: 

10 Aratio = A/(A + C + G + T) 

C ratio = C/(A + C + G + T) 

Q ratio = G/(A + C + G + T) 

IS 

Tratio = T/(A + C + G + T) 

At position 241, A ratio would be the wild-type ratio. These ratios are generally calculated in order to "normalize" the 
intensity data as the photon counts may vary widely from experiment to experiment. Thus, the ratios provide a way of 

20 reconciling the intensity variations aaoss experiments. Preferably, if the photon counts do not vary widely from experi- 
ment to experiment, the probe intensities do not need to be "normalized." 

At step 512 the highest base intensity associated with the sample sequence is checked to see if it has sufficient 
intensity to call the unknown base. The intensity is checked by determining if the highest intensity sample base is greater 
than the background difference cutoff. If the intensity is not greater than the background difference cutoff, the sample 

25 sequence faWs to have sufficient intensity as shown at step 514. Otherwise, at step 516 the sanrple sequence passes 
by having suffk:ient intensity. 

At step 518 calculations are performed on the background subtracted base intensities of the sample sequence in 
order to "normalize" the intensities. Each position in the sample sequence has four background subtracted base inten- 
sities associated with it. The ratios of the intensity of each base to the sum of the intensities of the possible t>ases (all 
30 four) are cateulated. resulting in four ratios, one for each base as shown in the data table. 

At step 520 if either the reference or sample sequences failed to have suffksient intensity, the unknown base is 
assigned the code N (insufficient intensity) as shown at step 522. 

At step 524 tiie normalized base intensities of the reference sequence are subtracted from the normalized base 
intensities of the sample sequence. Thus, at each position the following calculations are performed: 

35 

A Difference » Sanple A Ratio - Reference A Ratio 
C Difference = Sample C Ratio - Reference C Ratio 
40 G Difference » Sample G Ratio - Reference G Ratio 

T Difference = Sample T Ratio - Reference T Ratio 

where tiie reference and sample ratios are calculated at steps 51 0 and 51 8. respectively The base differences resulting 

45 from these calculations are shown in tiie data table. 

At step 526 each position is checked to see if there is a base difference greater than an upper difference cutoff and 
a base difference lower than a lower difference cutoff. For example. Fig. 11C shows a graph tiie normalized sample 
base intensities minus the normalized reference base intensities. Suppose that the upper difference cutoff is 0.15 and 
the lower difference cutoff is -0.1 5 as shown by the horizontal lines in Fig. 1 1 C. At the mutation position (labeled with a 

so reference 0). tiie G difference is 0.28 which is greater than 0.15, the upper difference cutoff. Similariy, tiie A difference 
is -0.32 which is less than -0.15, the lower difference cutoff. As there is a base difference above the upper difference 
cutoff and a base difference below the tower difference cutoff, there may be mutation at this position. 

If there is neither a base difference above tiie upper difference cutoff nor a base difference below the lower difference 
cutoff, the base at tiiat position is assigned the code of the reference wikJ-type base as shown at step 528. 

55 At Step 530 the ratio of the highest background subti'acted base intensity in the sample to the background siM-acted 
reference wikJ-type base intensity is calculated. For exanrple, at tiie mutation position 241 in the data table, the highest 
background subti-acted base intensity in tiie sample is 571 (base G). The background subtracted reference wild-type 
base intensity is 385 (base A). The ratio of 571 :385 is calculated and results in 1.48 as shown in the data table. 
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At step 532 these ratios are compared to a ratio at a neighboring position. The ratio for the n* position is subtracted 
from tlie ratio far the r" position, where r - n + 1 . For example, at the mutation position 241 in the data table, the ratio 
at position 242 (which equals 1.02) is subtracted from the ratio at position 241 (which equals 1.48). R has been found 
that a mutant can be conf idenUy detected by analyzing the difference of these neighboring ratios. 

R9-llDshowsothergraphsofdatainfliedatatable.Ofparticularifnportanceisthegraphidentifi 

ttiisisagraph of thecalculations at step 532. The pattern shown inabox in graph 532 has been found to be characteristic 
of a mutation. Thus, if this pattern is detected, the base is called as the base (or bases) with a normalized difference 
greater ttian the upper difference cutoff as shown at step 536. For example, the pattern was detected and at step 526 
rtwas shown that base G had a normalized difference of 0.28. which is greater than the upper difference cutoff of 0 1 5 
Therefore, ttie base at position 241 in the sample sequence is called a base Q. which is a mutation from the reference 
sequence (A to G). 

If the pattern is not detected at step 534. the base at that position is assigned ttie code of the reference wild-type 
base as shown at step 538. 

This second implementation of the reference mettwd is preferable in some instances as H takes into account probe 
intensities of neighbonng probes. Thus, the first implementation may not have detected the A to G mutation in this 
exampla 

The advantage of the reference method is that the con-ect base can be called even in the presence of significant 
levels of cross-hybndization. as long as ratios of intensities are lairly consistent from experiment to experiment in prac- 
tice, ttie number of miscalls and ambiguities is significantiy reduced, white the number of correct calls is actually 
increased, making the reference method very useful for identifying candidate mutations. The reference mettiod has also 
been used to compare the reproducibility of experiments in terms of base calling. 

IV. Statistical Mirthnrt 

The statistical method '6 a method of calling bases in a sample nucleic add sequence. The statistical method utilizes 
the statistical variation across experiments to call ttie bases. Therefore, the statistical method is preferable when data 
from multiple experiments is available and ttie data is fairly consistent across ttie experiments. The method compares 
the probe intensities of a sample sequence fc> statistics of probe intensities of a reference sequence in multiple experi- 

merits. 

For simplicity, ttie statistical mettiod will be described as being used to ktentify one unknown base in a sample 
nucleic acid sequence. In practice, ttie mettiod is used to identify many or all the bases in a nucleic acid sequence 

The unknown base will be called by comparing the probe intensities of a sample sequence to statistics on probe 
intensities of a reference sequence in multiple experiments. Generally, the probe Intensities of ttie sampte sequence 
and ttie reference sequence experiments are from chips having ttie same chip wiW-type. However ttie reference 
sequence may or may not be equal to ttie chip wild-type, as it may have imitations. 

A b^ at the same position in ttie reference and sample sequences will be associated with up to four mutation 
probes and a Tjlank" cell. As before, because each mutation probe Is identifiable by ttie mutation base, ttie mutation 
probes' intensities will be referred to as ttie "base intensities" of their respective mutation bases. 

As a simple example of ttie statistical mettiod. suppose a gene of interest (targ«) has the sequence 5*-AAAACT- 
GAAAA-3' (SEQ ID NO:4). Suppose a reference sequence has ttie sequence 5'-AAAACQGAAAA-3' (SEQ ID NO S) 
which differs from ttie target sequence by ttie underlined base. Suppose furttier that a sample sequence is suspected 
sequence as ttie target sequence except for a T base mutation at the underlined base position in 5'- 
AAAACIGAAAA-3' (SEQ ID N0:4). Suppose that in multiple experiments ttie reference sequence is marked and 
exposed to probes on a chip. Suppose furttier the sample sequence is also marked and exposed to probes on a chip 

The following are comptementary "mutation- probes ttiat could be used fbr a reference experiment and the sanele 
sequence: 



Reference 


Sample 


3'-TGAC 


3'-GACT 


3*-TGCC 


3'-GCCT 


3'-TGGC 


3'-GGCT 


3'-TGTC 


3'-GTCT 



"mutation" probes shown for the reference sequence may be from only one experiment, the other experiments 
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have different "mutation" probes, chip wild-types, tiling methods, and the like. Although each fluorescence intensity Is 
from a probe, since the probes may be identified by their unique mutation bases, the probe intensities may be identified 
t)y their respective bases as follows: 

5 



Reference 


Sample 


3*-TQAC->A 
3*-TGCC -> C 
3'-TGGC->G 
3'-TGTC •> T 


3'-GACT->A 
3'-GCCT->C 
3*-GGCT->G 
3*-GTCT->T 



IS 

Thus. k>ase A of the reference sequence will be described as having an intensity which conresponds to the Intensity of 

the mutation probe with the mutation t>ase A. The statistical method will now be described as calling the unknown base 

in the sample sequence by using this example. 

Fig. 12 illustrates the high level flow of tiie statistical method. At step 602 the four base intensities associated with 
20 the sample sequence and each of the multiple reference experiments are adjusted by subtracting tiie background or 

"blank" celt intensity from each base intensity. Preferably, if a base intensity is then less than or equal to zero, the base 

Intensity is set equal to a small positive nuvrber to prevent division by zero or negative numbers. 

At step 604 the intensities of the reference witd-type t}ases in the multiple experiments are checked to see if they 

all have sufficient intensity to call the unknown base. The intensities are checked by determining if the intensity of tiie 
25 reference wikJ-type base of an experiment is greater than a predetermined background difference cutoff. The wild-type 

probe shown earlier for tiie reference sequence is 3'-TGGC. and thus the G base intensity is the wiki-type base intensity. 

These steps are analogous to steps in the other two methods described herein. 

If the intensity of any one of the reference wild-type bases is not greater than the background difference cutoff, tiie 

wild-type experiments fail to have sufficient intensity as shown at step 606. Otiienwise, at step 608 the wild-type exper- 
30 iments pass by having sufficient intensity. 

At step 610 calculations are performed on tiie background subtracted base intensities of each of tiie reference 

experiments in order to "normalize" the intensities. Each reference experiment has four background subtracted base 

intensities associated witii it: one wild-type and three for the other possible bases. In this example, the G base intensity 

is the wild-type, the A, C, and T base intensities being the "other" intensities. The ratios of the intensity of each base to 
35 tiie sum of tiie intensities of tiie possible bases (all four) are calculated, giving one wild-type ratio and tiiree "other" ratios. 

Thus, tiie following ratios would be calculated: 

Aratio = A/(A + C + G + T) 
40 C ratio = C/{A + C + G + T) 

G ratk) = G/(A + C + G + T) 
Tratio=T/(A + C + G + T) 

45 

where G ratio is the wild-type ratio and A, C. and T ratios are the "otiier" ratios. These four ratios are calculated for each 
reference experiment. Thus If tiie number of reference experiments is n. there would be 4n ratios calculated. These 
ratios are generally calculated in order to "normalize" the intensity data, as tiie photon counts may vary widely from 
experiment to experiment. However, if the probe intensities do not vary widely from experiment to experiment, the probe 

so intensities do not need to be "normalized." 

At step 612 statistics are prepared for ttie ratios calculated for each of tiie reference experiments. As stated before, 
each reference experiment will be associated with one wikJ-type ratio and three "other" ratios. The mean and standard 
deviation are calculated for all the wild-type ratios. The mean and standard deviation are also calculated for each of the 
other ratios, resulting in tiiree other means and standard deviations for each of tiie bases that is not tiie wiki-type base. 

55 Therefore, tiie fbltowing would be calculated: 

Mean and standard deviation of A ratios 
Mean and standard deviation of C ratios 
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Mean and standard deviation of Q ratios 
Mean and standard deviation of T ratios 

where tlie mean and standard deviation of the G ratios are also known as the wild-type mean and the wild-type standard 
deviation, respectively. The mean and standard deviation of the A. C, and T means and standard deviations are also 
known collectively as the "other" means and standaid deviations. 

Suppose that the preceding calculations produced the following data: 

A ratios -> mean s 0.16 std. dev. = 0.003 

C ratios -> mean = 0.03 std. dev. = 0.002 

G ratios -> mean = 0.71 std. dev. = 0.050 

T ratios -> mean = 0.11 std. dev. = 0.004 

In one embodiment, the steps up to and Including step 612 are performed in a preprocessing stage for the multiple 
wild-type experiments. The results of the preprocessing stage are stored in a file so that the reference calculations do 
not have to be repeatedly calculated, improving performance. 

At step 614 the highest base intensity associated with the sample sequence is checked to see if It has sufficient 
intensity to call the unknown base. The intensity is checked by determining if the highest intensity unknown base is 
greater than the background difference cutoff. If the Intensity is not greater than the background difference cutoff the 
sample sequence falls to have sufficient intensity as shown at step 616. Othenwise. at step 618 the sample sequence 
passes by having sufficient Intensity. 

At step 620 calculations are performed on the four background subtracted intensities of the sample sequence The 
ratios of the background subtracted intensity of each base to the sum of the background subtracted intensities of the 
possible bases (all four) are calculated, giving four ratios, one for each base. For consistency, the ratio associated with 
the reference wild-type base is called the wild-type ratio, with there being three •'other" ratios. Thus, the followinQ ratios 
are calculated: 



A ratio = A/(A + C + G + T) 

Cratio = C/(A + C + G + T) 

Gratio = G/(A + C + G + T) 

Tratk} = T/(A + C + G + T) 

where ratio G is the wild-type ratio and ratios A. C. and T are the "other" ratios. 

Suppose the background subtracted intensities associated with the sample are as follows* 
A.>310 
C->50 
G->26 
T.>100 

Then, the corresponding ratios would be as follows: 

A ratio =B 310 / (310 -i- 50 + 26 + 100) = 0.64 

C ratio = 50 / (310 + 50 + 26 + 100) = 0.10 

G ratio = 26 / (310 + 50 + 26 100) = 0.05 

T ratio = 1 00 / (31 0 + 50 + 26 + 1 00) = 0.21 

At step 622 if either the reference experiments or the sample sequence failed to have sufficient intensity, the 
unknown base is assigned the code N (insufffolent intensity) as shown at step 624. 

At step 626 the wild-type and "other" ratios associated with the sample sequence are compared to statistical expres- 
sions. The statistical expressions include four predetermined standard deviation cutoffs, one associated with each base. 
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Thus, there is a standard deviation cutoff for each of the bases A. G. and T The localized standard deviation cutoffs 
allow the unknown base to be called with higher predston because each standard deviation cutoff can be set to a different 
value. Suppose the standard deviation cutoffs are set as follows: 

A standard deviation cutoff -> 4 

C standard deviation cutoff -> 2 

G standard deviation cutoff -> 8 

T standard deviation cutoff -> 4 

The wild-type base ratio associated with the sample is compared to a corresponding statistical expression: 

WT ratio g WT mean - {WT std. dev. * WT base std. dev. cutoff) 

where the WT base std. dev. cutoff is the standard deviation cutoff for the wild-type base. As the wild-type base is G. 
the above comparison solves to the following: 

0.05 2:0.71 -(0.050 * 8) 

0.05 g 0.31 

which is not a true expression (0.05 is not greater than 0.31). 

Each of the "other" ratios associated with the sample is compared to a corresporxiing statistical expression: 

Other ratio > Other mean + (Other std. dev. * Other base std dev. cutoff) 

where the Other base std. dev. cutoff is the standard deviation cutoff for the particular "other** base. Thus, the above 
comparison solves to the following three expressions: 

A-> 0.64 > 0.16 + (0.003* 4) 

0.64 > 0.17 
0 -> 0.10 > 0.03 + (0.002 • 2) 

0.10 > 0.03 
T-> 0.21 > 0.11 +(0.004*4) 

0.21 > 0.13 

which are all true expressions. 

At step 628 if only the wild-type ratio of the sample sequence was greater than the statistical expression, the unknown 
base is assigned the code of the reference wild-type base as shown at step 630. 

At step 632 if one or more of the "other" ratios of the sample sequence were greater than their respective statistical 
expressions, the unknown base is assigned an arrd^iguily code that indicates the unknown base may be any one of the 
complements of these bases, including the reference wild-type. In this example, the "other" ratios for A. C. and T were 
all greater than their corresponding statistical expression. Thus, the unknown base would be called the complements 
of these bases, represented by the subset T, G, and A. Thus, the unknown base wouki be assigned the code D (meaning 
"not C"). 

If none of the ratios are greater than their respective statistical expressions, the unknown base is assigned the code 
X (insufficient disaimination) as shown at step 636. 

The statistical method provides accurate base calling because it utilizes statistical data from multiple reference 
experiments to call the unknown base. The statistical method has also been used to implement confidence estimates 
and calling of mixed sequences. 
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V. Pooling ProcesRinq 



The present invention provides pooling processing which is a method of processing reference and sample nucleic 
acid sequences together toreduce variations across Individual experiments. Intherepresentativeembodimentdlscussed 
herein, the reference and sample nucleic acid sequences are labeled with different fluorescent markere emitting tight at 
different wavelengths. However, the nucleic acids may be labeled with other types of markers including distinguishable 
radioactive markers. 

After the reference and sample nucleic acid sequences are labeled with different color fluorescent markers the 
labeled reference and sample nucleic acid sequences are then combined and processed together. An apparatus for 
detecting targets labeled with different markers is provided in U.S. Application No. 08/1 95,889 and is hereby incorporated 
by reference for all purposes. 

Fig. 13 illustrates the pooling processing of a reference and sample nucleic acid sequence. At step 702 a reference 
nucleic add sequence is marked with a fluorescent dye. such as fluorescein. At step 704 a sample nucleic acid sequence 
IS marked with a dye that, upon excitation, emits light of a different wavelength than that of the fluorescent dye of the 
reference sequence. For example, the sample nucleic acid sequence may be marked with rhodamine. Alternatively, the 
sample nudeic add sequence may be marked by attaching biotin to the sample sequence which will subsequently bind 
to streptavkJin labeled with phycoerythrin. Of course, either sequence may be marked with these or other dyes or other 
kinds of markers (e.g., radioactive) as long as the other sequence Is marked with a marker that is distinguishable. 

At step 706 the labeled reference sequence and the labeled sample sequence are combined. After this step, process- 
ing continues In the same manner as for only one labeled sequence. At step 708 the sequences are fragmented. The 
fragmented nudeic add sequences are then hybrkJized on a chip containing probes as shown at step 710. 

At step 712 a scanner generates image files that indicate the locations where the labeled nucleic acids bound to 
the chip. There is typically some overlap between the two signals. This is corrected for prior to further analysis, i e after 
correction, the data files correspond to "reference" and "sample." In general, the scanner generates an image file by 
focusing excitation light on the hybridized chip and detecting the fluorescent light that is emitled. The marker emitting 
tile fluorescent light can be identified by the wavelengtii of the light. For example, ttie fluorescence peak of fluorescein 
Is about 530 nm while that of a typical rhodamine dye is about 580 nm. 

The scanner aeates an image file for the data associated with each fluorescent marker, indicating the locations 
where tiie correspondingly labeled nudeic acid bound to the chip. Based upon an analysis of the fluorescence intensities 
and locations, it becomes possible to extract information such as the monomer sequence of DNA or RNA. 

Pooling processing reduces variations across individual experiments because much of the test environment is com- 
mon. Although pooling processing has been deswibed as being used to improve ttie combined processing of reference 
and sample nudeic acid sequences, the process may also be used for two reference sequences, two sanple sequences, 
or multiple sequences by utilizing multiple distinguishable markers. 

Pooling processing may also be utilized with methods of the present invention of identifying mutations in a sample 
nudeic acid sequence. These methods are highly accurate in identifying single mutations, locating multiple mutations 
and removing ^se positives for mutations, where a felse positive is a base that has erroneously been Identified as a 
mutation. These methods utilize hybridization data from more tiian one base position to identify the likely position of 
mutations. The interrogation position on the probes Is utilized to more accurately identify likely mutations which makes 
more efficient use of base calling methods. These methods may be advantageously combined with the base calling 
methods described herein to efficiently and accurately sequence a sanple nucleic acki sequence. 

As discussed earlier in reference to Fig. 8, the fluorescent intensities of cells near an interrogation position having 
a mutation are relatively dark which creates "dark regions" around tfie mutation. These lower fluorescent Intensities 
result because the cells at interrogation positions near a mutation do not contain probes that are perfectly complementary 
to the sample sequence. Thus, the hybridization of these probes with the sample sequence is tower. The characteristics 
of tiiese "dark regions" may be utilized to Identify mutations and false positives. 

For example, a sample sequence and a reference sequence were labeled witti different fluorescent markers, in this 
case fluorescein and blotin/|phycoerythrin. The sample and reference sequences are known and the sample sequence 
IS kJentical to the reference sequence except for mutations at certain known positions. The sample and reference 
sequences were then processing together using the pooling processing described above and the sequences were hybrid- 
ized to a chip including wild-type probes ttiat are perfectiy complementary to ttie reference sequence. The chip induded 
20-mer probes witti the interrogation position of each probe being at the 12^*^ base position in the probe. 

Fig. 14A shows a graph of tiie scaled fluorescent intensities (photon counts) of the wild-type probes hybridizing with 
the sample and reference sequences. Along ttie bottom of the graph are numbers which represent wild-type cell positions 
on the chip. The photon counts of the probes In the wild-type cells are plotted on a logarithmic scale of 10". As shown, 
the photon counts range from 1 (representing a de minimus value) and 100.000. The photon counts for ttie probes iri 
the wild-type cell numbered "45" is around 10.000. 

At various wiW-type cells, tfie photon count for the probes in the cells drops to 1 or lower For example, the photon 
counts for wild-type cells numbered 1 1, 24. 39. etc. are 1 . The low photon counts are due to ttie fact that there are no 
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probes in these cells. The cells are left iDlank" In order to minimize diffraction edges and thus, the location of these tHark 
cells is known. Consequently, the intermittent wild-type cells that have a photon count of 1 do not represent erroneous 
data. 

As shown in Rg. 14A, the scaled photon counts for the wild-type probes hybridizing with the sample and reference 
5 sequences are almost the same except for two "bubbles/ A bubble 730 has a top curve defined by the photon counts 
of the wild-type probes that hybridized with the reference sequence and a bottom curve defined by the photon counts 
of the wild-type probes that hybridized with the sample sequence. Following bubble 730, there is a section 732 where 
the photon counts for the wild*type protDes hybridizing with the sample and reference sequences are almost the same. 
After section 732 is another bubble 734 which again has a top curve defined by the hybridization of the reference 
10 sequence and the bottom curve defined by the hybridization of the sample sequence. Another partial bubble is shown 
to the right of bubble 734. 

Each but)ble in Fig. 1 4A corresponds to a dark region surrounding a single mutation. Because the wild-type probes 
at and surrounding a mutant position In the sample sequence contain a single base mismatch with the sanple sequence, 
the hybridization is relatively lower which results in lower photon counts. Much information about the sample sequence 

75 may be acquired by a detailed analysis of these bubble regions. 

The width of the bubble indicates whether there is a felse positive, a single mutation or a multiple mutation. If there 
Is a single mutation, tt^e width of the bubble shoukJ be approximately equal to tiie probe length. For example. Rg. 14A 
was produced utilizing 20-mer probes. Accordingly, txjbbles 730 and 734 are approximately 20 wild-type cells wide 
indicating that the both these txjbbles were produced by single mutations. The width of the dark region resulting from a 

20 single mutation is believed to be approximately equal to the probe lengtii because each of the probes in tiiis region have 
a single base mismatch witii ttie sample sequence. 

If the width of the bubble is substantially less than the probe lengtii, tiie bubble may represent a false positive. For 
example, assume that at wild-type cell number 45 in Rg. 14A, the hybridization of the wild-type probe with the sample 
sequence was very low (e.g., around 1000 photon counts). A base calling algorithm that calls the bases according to 

25 the intensities among the cells at tiiat position may indicate that there is a mutation at this position. However, the low 
photon counts may be due to dust on the chip and not due to lower hybridization. Since the widtii of tiiis bubble would 
be 1 . which is substantially lower than the probe width of 20. tfie lower photon count at wild-type cell 45 would not be 
due to a mutation (i.e.. there Is no dark region surrounding that position). 

If tiie width of tiie bubble is substantially more than the probe lengtii. tiie bubble may represent multiple mutations. 

30 In otiier words, tiie bubble may be produced by more ttian one overlapping dark region. The analysis of such a bubble 
will k>e discussed in more detail in reference to Rg. 14C. 

Returning to Fig. 1 4A. each of bubbles 730 and 734 are approximately 20 bases wide indicating with a high degree 
of certainty tiiat each of the bubbles represent a single mutation. Furthermore, the but^bles may be analyzed to determine 
the probable location of the mutations within the bubbles. As mentioned earlier, the 20-mer probes on tiie chip had an 

35 inten'ogation position at the 12^^ base position in the probe. Thus, the base at the 12^^ base position is tiie base tiiat 
varies among the related WT-. A-. C-, G- and T-cells. Accordingly, the mutation should be located at the 12^ position in 
the bubble. 

The actual mutation in bubble 730 occurs at the 12th position (from tiie left). Additonally. the actual mutation in 
bubble 734 occurs at tiie 12th position (from the left). Thus, as the graph shows, there are 1 1 bases to the left of each 
40 mutation and 8 bases to the right of each mutation. By utilizing the location of the interrogation position within the probes, 
the present invention can help to identify the probable location of a mutation within a dark regton or bubble. 

Additionally, because tiiis method identifies specific locations that may have a mutation, more efficient base calling 
may be achieved. For example, an analysis of but)ble 730 indicates that there is likely to be a single mutation around 
wild-type cell 15. Typically, most enters in base calling occur in the dark regions surrounding a mutation. Many false 
45 positives in this dark zone can now be eliminated because they are incompatible witti the bubble size (which indicates 
single mutation, for example). Also, by identifying clearly a "mismatch zone." we can now apply algorittims that factor in 
the effect of a mismatch or multiple mismatches. 

Additionally, the shape of tiie bubble may indicate what mutation has occuned. Fig. 14B shows a hypothetical graph 
of the fluorescent intensities vs. cell locations for wild-type probes hybridizing with two sample sequences and one 
so reference sequence. A C-A mismatch will be more destabilizing to probe hybrkJization than a U-Q mismatch. As shown, 
the more destabilizing C-A mismatch results in a larger volume bubble. The shape of the bubble may be utilized to identify 
the particular mutation by pattern matching butibles stored in a library 

Fig. 14C shows a graph of the fluorescent intensities (photon counts) of tiie wild-type probes hybridizing with the 
sample and reference sequences. A single bubble 750 is flanked on eitiier side by regions 752 and 754 which do not 
55 contain a mutation. The graph was produced from a chip containing 20-mer probes with an interrogation position at base 
12onttie probes. 

As shown, bubble 750 is 27 bases wide indicating that the bubble was produced from the dark regions surrounding 
more than one rrutation as 27 is greater than 20 or the length of the probes. In addition to providing information that 
there are multiple mutations, analysis of the bubble indicates the probable position of two of the mutations. Because tiie 
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inteiTogation position is at base 1 2 in the 20-mer probes, one of the mutations should be around 12 bases from the left 
end of «ie bubble while another mutations should be around 8 bases from the right end of the bubble. And in fact, there 
^1^?^" J. *° ° ^.*""-'yP® ®2 which is 12 bases from the left of the bubble. Additionally, there is a mutation 
or A to G at wiM'type cell 69 which is 8 bases from the right of the bubble. 

ThethirdandlastmutationwithinbubbleTSOmaybeidertifiedbyperfbrmingbasecalfingmeth^ 
Wternatvely.menrujtatonmaybeidertifiedbypatt 

or mutations but also the spedfic location and type of mutation. 

Fig. 15 illustrates the high level flow of one embodiment of the present invention that uses the hybridization data 

from rnore than one base position to identify mutations in a sample nucleic add sequence. After probe intensities from 

the hybr.dizat.on of wild-type probes with a sample and reference sequence are measured, the system identifies a bubble 

region at step 780. Bubble regions are identif iedas regions where the hybridization of the wild-type probes to the sample 

and reference sequencediffer signiTicantiy. Additionally, the reference sequence should hybridize more strongly with tfie 

wiw-type probes since the wild-type probes will be perfectly complementary to the reference sequence 

Atstep782.thesystemcomparesthebasewidthofthebubbletotheprobelength.lfthebubblewidthi^ 
tess than the probe length, the bubble does not represent a mutation at step 784. The determination of how much less 
the bubble width may vary according to experiment conditions. 

At step 786. *ie system compares the base width of the bubble to the probe length to determine if they are approx- 
irrotely equal If thebittle width is approximately equal to the probe length, the bubWer^^^ 
at step 788. Again, the determination of how close the bubble width should be to the probe length may vary according 
to experiment conditions. » j j ^^^«>fv 

If the bubble width is substantially more than the probe length, the bubble represents multiple mutations at step 790 
The system performs base calfing at likely locations of mutations at step 792. The likely locations of mutations are 
determined by both the width of the bubble and the location of the interrogation position on the probes. Additionally the 
system may analyze the pattern of ttie bubble to determine the spedfic mutations and their positions by analyzing the 
pattern of the bubble. The base calling mettiod with the present invention may be ttie intensity ratto mettrod. reference 
method, statetical mettiod. or any ottier mettiod. 

At step 794. tfie system produces confidences that the mutations are identified correctiy. Eadi confidence is deter- 
mined by how dosely the experimental data matdied the data expected for ttie mutation that was called For example 
If the bubble width was exactty the same as the probe lengtti and the base calling method identified a mutation at ttie 
Hi^errogation position in ttie probes, ttiere is a very high likelihood or probability that the mutation was identified correctly. 
The confidence may also be produced according to how closely ttie bubble pattern matdied ttie pattern for ttiat mutation 
or mutations in the Ifarary of patterns. 

Although in a preferred embodiment, ttiis method of identifying mutations in a sample nucleic acid sequence is 
Utilized m coniunction with pooling processing in order to reduce variations, the mettiod may be utilized wittiout poolina 
processing. For example, ttie mettiod may be utilized effectively where the variations between separate experiments is 
minimized or the data is adjusted accordingly. Therefore, ttiis mettiod is not limited to ttie embodiment discussed above 

The present invention provides mettiods of accurately identifying single mutations, locating multiple mutations and 
removing false podtives for mutations. These mettiods are advantageously performed with pooling processing and utilize 
hybridization data from more than one base poation to identify ttie likely position of mutations. The interrogation position 
on ttie probes is also utilized to more accurately identify ttie likely podtion of mutations whidi makes more effident use 
of base calling methods. 

VI. Comparative Analysis (yi^^it^ 

The present invention provides a mettiod of comparative analysis and visualization of multiple experiments The 
method allows ttie intensity ratio, reference, and statistical mettiods to be run on multiple datafiles simultaneously. This 
permite different experimental condftions. sample preparations, and analysis parameters to be conpared in temis of 
ttieir effects on sequence calling. The mettiod also provides verification and edHing fundions. whidi are essential to 
reading sequences, as well as navigation and analysis tools. 

Fig. 16 illustrates ttie main screen and ttie associated pull down menus for comparative analysis and visualization 
of muttiple experiments (SEQ ID N0:8 and SEQ ID N0:9). The windows shown are from an appropriately programmed 
Sun Workstation. However, ttie comparative analysis software may also be implemented on or ported to a personal 
computer, induding IBM PCs and compatibles, or other workstation environments. A window 802 is shown having pull 
down menus for the following functions: File 804. Edit 806. View 808, Highlight 810. and Help 812. 

The main section of ttie window is divided into a reference sequence area 814 and a sample sequence area 816 
The reference sequence area is where known sequences are displayed and is divided into a reference name subarea 

818andreference base subarea 820. Therelerence name subareaisshownwittittiefilenames that contain the reference 
sequences. The chip wiW-type is identified by ttie filename witti ttie extension " wtr where ttie # indicates a unit on ttie 
chip. The reference base subarea contains ttie bases of the reference sequences. A capital C 822 is displayed to ttie 
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right of the reference sequence that is the chip wild-type for the current analysis. Although the chip wild-type sequence 
has associated fluorescence intensities, the other reference sequences shown below the chip wild-type may t>e known 
sequences that have not been tiled on the chip. These may or may not have associated fluorescence intensities. The 
reference sequences other than the chip wild-type are used for sequence comparisons and may be in the form of simple 
5 ASCII text files. 

Sample sequence area 816 is where sample or unknown experimental sequences are displayed for comparison 
with the reference sequences. The sample sequence area is divided into a sample name subarea 824 and sample base 
subarea 826. The sample name subarea is shown with filenames that contain the sample sequences. The filename 
extensions indicate the method used to call the sample sequence where '*.cq#* denotes the intensity ratio method, **. rq#" 

10 denotes the reference method, and ".sq#'' denotes the statistical method (# indicates the unit on the ch|p). The sample 
base subarea contains the bases of the sample sequences. The bases of the sample sequences are identified by the 
codes previously set forth which, for the most part, conform to the lUPAC starxlard. 

Window 802 also contains a message panel 828. When the user selects a base with an input device in the reference 
or sample base sut>area. the base becomes highlighted and the patiiname of the file containing the base is displayed 

15 in the message panel. The base's position In the nucleic add sequence is also displayed in the message panel. 

In pull down menu File 804, the user is able to load files of experimental sequences that have been tiled and scanned 
on a chip. There is a chip wild-type associated with each experimental sequence. The chip wild-type associated with 
the first experimental sequence loaded is read and shown as the chip wiki-type in reference sequence area 814. The 
user is also able to load files of known nucleic acid sequences as reference sequences for comparison purposes. As 

20 before, these known reference sequences may or may not have associated probe intensity data. Additionally, in this 
menu the user is able to save sequences that are selected on the screen into a project file that can be loaded in at a 
later time. The project file also contains any linkage of the sequences, where sequences are linked for comparison 
purposes. Sequences to be saved, both reference and sample, are chosen by selecting the sequence filename with an 
input device in the reference or sample name subareas. 

25 In pull down menu Edit 806, the user is able to link togettier sequences in the reference and sample sequence 
areas. After the user has selected one reference and one or more sample sequences, the sample sequences can be 
linked to the reference sequence by selecting an entry in the pull down menu. Once the sequences are linked, a link 
number 830 is displayed next to each of sequences of related interest. Each group of linked sequences is associated 
witii a unique link number, so the user can easily identify which sequences are linked together. Linking sequences 

30 permits ttie user to more easily compare tiie linked sequences. The user is also able to remove and display links from 
this menu. 

In pull down menu View 808. the user is able to display intensity graphs for selected bases. Once a base is selected 
in the reference or sample base subareas, the user may request an intensity graph showing the hybridized probe inten- 
sities of the selected base and a delineated neighborhood of bases near the selected base. Intensity graphs may be 

35 displayed for one or multiple selected t>ases. The user is also able to prepare comment files and reports in this menu. 
Fig. 1 7 illustrates an intensity graph window for a selected base at position 120 (SEQ ID NO:30 and SEQ ID N0:31). 
The filename containing the sequence data is displayed at 904. The graph shows the intensities for each of the hybridized 
probes associated with a base. Each grouping of four vertical bars on the graph, which are labeled as "a", "c". "g**, and 
T on line 908, shows the background subtracted intensities of prot>es having the indicated substitution base. In one 

40 enrtodiment, the called bases are shown In red. The wild-type t)ase is shewn at line 908, tiie called base is shown at 
line 910, and the base position is shown at line 912. In Fig. 1 7. tiie base selected is at position 120, as shown by anow 
914. The wild-type base at this position is T; however, the called base is M which means the base is eittier A or C (amino). 
The user is able to use intensity graphs to visually compare the intensities of each of tiie possible calls. 

Fig. 18 illustrates multiple intensity graph windows for selected bases (SEQ ID NO:32, SEQ ID NO:33. SEQ ID 

45 NO:34, and SEQ ID NO:35). There are three intensity graph windows 1002, 1004, and 1006 as shown. Each window 
may be associated with a different experiment, where the sequence analyzed in the experiment may be eitiier a reference 
(if it has associated probe intensity data as in the chip wild-type) or a sample sequence. The windows are aligned and 
a rectangular kx>x 1008 shows the selected bases* position in each of the sequences (position 162 in Fig. 18). The 
rectangular box aids the user in identifying the selected bases. 

so Referring again to Fig. 16, inpulldownmenuHighlight810,theuserisabletocomparetiiesequencesof references 
and samples. At least four comparisons are available to the user, including the following: sample sequences to tiie chip 
wild-type sequence, sample sequences to any reference sequences, sample sequences to any linked reference 
sequences, and reference sequences to the chip wild-type sequence. For example, after the user has linked a reference 
and sample sequence, the user can compare the bases in the linked sequences. Bases in the sample sequence that 

55 are different from the reference sequence will then be indicated on the display device to the user (e.g.. base is shown 
in a different color). In anotiier example, the user is able to perform a comparison that will help identify sample sequences. 
After a sample is linked to multiple reference sequences, each base in the sample sequence that does not match the 
wild-type sequence is checked to see if it matches one of the linked reference sequences. The bases tiiat match a linked 
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reference sequence will then be indicated on the display device to the user. The user may then more easily identify the 
sample sequence as being one of the reference sequences. 

In pull down menu Help 812, the user is able to get Information and instructions regarding the comparative analysis 
program, the calling methods, and the lUPAC definitions used in the program. 

Fig. 1 9 illustrates the Intensity ratio method conrectly calling a mutation In solutions with varying concentrations (SEQ 
ID NO:10. SEQ ID N0:11, SEQ ID N0:12. SEQ ID N0:13, SEQ ID N0:14, SEQ ID N0:15. SEQ ID N0:16, SEQ ID 
N0:17. and SEQ ID N0:18). A window 1102 is shown with a chip wild-type 1104 and a mutant sequence l'l06. The 
mutant sequence differs from the chip wild-type at the position indicated by the rectangular box 1 108. The chip wild- 
type and mutant sequences are a region of HIV Pol Gene spanning mutations occurring in AZT drug therapy. 

There are seven sample sequences that are called using the intensity ratio method. TTie sample sequences are 
actually solutions of different proportions of the chip wild-type sequence and the mutant sequence. Thus, there are 
sample solutions 1 1 10. 1 1 12. 1 1 14. 1 1 16. 1 1 18. 1 120. and 1 122. The solutions are 15-mer tilings across the chip wild- 
type with increased percentages of the mutant sequence from 0 to 100% by weight The fbllowing shows the proportions 
of the sample solutions: 



Sample Solution 


Chip Wild-Type:Muiant 


1110 


100:0 


1112 


90:10 


1114 


75:25 


1116 


50:50 


1118 


25:75 


1120 


10:90 


1122 


0:100 



For example, sample solution 1114 contains 75% chip wild-type sequence and 25% mutant sequence. 

Now refenring to the bases called in rectangular box 1 108 for the sample solutions, the intensity ratio method con^ectiy 
calls sample solution 1 1 1 0 as having a base A as in the chip-wild type sequence. This Is correct because sample solution 
1 1 1 0 is 100% chip wild-type sequence. The intensity ratio method also calls sample solution 1 1 1 2 as having a base A 
because the sample solution is 90% chip wild-type sequence. 

The intensity ratio method calls the identified base in sample solutions 1 1 14 and 1 1 16 as being an R. which is an 
ambiguity lUPAC code denoting A or G (purine). This also a correct base call because the sample solutions have from 
75% to50%chip-wild type sequence and from 25% to50% mutation sequence. Thus, the intensity ratio method con^ectly 
calls the base in this transition state. 

Sample solutions 1 1 1 8. 1 1 20, and 1 1 22 are called by the intensity ratio method as having a mutation base G at the 
specified location. This is a con-ect base call because the sample solutions primarily consist of the mutation sequence 
(75%, 90%, and 100% respectively). Again, the intensity ratio method correctly called the bases. 

These experiments also show that the base calling methods of the present invention may also be used for solutions 
of more than one nucleic add sequence. 

Fig. 20 illustrates the reference method correctly calling a mutant base where the intensity ratio method incon-ectly 
called the mutant base (SEQ ID NO:36. SEQ ID NO:37. SEQ ID NO:38, and SEQ ID NO:39). There are three intensity 
graph windows 1202, 1204, and 1206 as shown. The windows are aligned and a rectangular box 1208 outlines the 
bases of interest. Window 1202 shows a sample sequence called using the intensity ratio method. However, the base 
in the rectangular bca 1208 was incorrectly called base C. as there is actually a base A at that position. The intensity 
ratio method incorrectly called the base as C because the probe intensity associated with base C is much higher than 
the other probe intensities. 

Window 1204 shows a reference sequence called using the intensity ratio method. As the reference sequence is 
known, it is not necessary to know the method used to call the reference sequence. However, it is important to have 
probe intensities for a reference sequence to use the reference method. The reference sequence is called a base C at 
the position indicated by the rectangular box. 

Window 1 206 shows the sample sequence called using the reference method. The reference method con*ectly calls 
the specif ied base as being base A. Thus, for some cases the reference method is preferable to the intensity ratio method 
because it compares probe Intensities of a sample sequence to probe intensities of a reference sequence. 
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VII. Examples 
Example 1 

5 The intensity ratio method was used in sequence analysis of various polymorphic HIV-1 clones using a protease 
chip. Single stranded DNA of a 382 nt region was used with 4 different clones (HXB2, SF2. NY5. pPol4mut18). Results 
were compared to results from an ABI sequencer. The results are illustrated below: 





ABI 


Protease Chip 




Sense 


Antisense 


Sense 


Antisense 


No call 


0 


4 


9 


4 


Ambiguous 


6 


14 


17 


8 


Wrong call 


2 


3 


3 


1 


TOTAL 


8 


21 


29 


13 


SUMf^ARY 
ABI (sense) - 99.5% 
Chip (sense) -98.1% 
ABI (antisense) - 98.6% 
Chip (antisense) - 99.1% 



Example 2 

30 

HIV protease genotyping was performed using the described chips and CallSeq™ intensity ratio calculations. Sam- 
ples were evaluated from AIDS patients before and after ddl treatment. Results were confirmed with ABI sequencing. 

Fig. 21 Illustrates the output of the ViewSeq^ program with four pretreatment samples and four posttreatment sam- 
ples (SEQ ID NO:22. SEQ ID NO:23, SEQ ID NO:24. SEQ ID NO:25. SEQ ID NO:26. and SEQ ID NO:27). Note the 
35 base change at position 207 where a mutation has arisen. Even adjacent two additional mutations (gt), the "a" mutation 
has been properly detected. 

The above desaiption is illustrative and not restrictive. Many variations of the invention will become apparent to 
those of skill in the art upon review of this disclosure. Merely by way of example, while the invention Is Illustrated with 
particular reference to the evaluation of DNA (natural or unnatural), the methods can be used in the analysis from chips 
40 with other materials synthesized thereon, such as RNA. The scope of the Invention should, therefore, be determined 
not with reference to the above description, but instead should be determined with reference to the appended claims 
along with their full scope of equivalents. 



45 



so 
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SEQUENCE LISTING 

5 

(1) GENERAL INFORMATION: 

(i) APPLICANT: 

(A) NAME: Affymax Technologies N.V. 
10 (B) STREET: De Ruyderkade 62 

(C) CITY: Curacao 

(E) COUNTRY: Netherlands Antilles 

(F) POSTAL CODE (ZIP) : none 

(ii) TITLE OF INVENTION: Computer -Aided Visualization and 
Analysis System for Sequence Evaluation 

(iii) NUMBER OP SEQUENCES: 39 

(iv) COMPUTER READABLE FORM: 
20 (A) MEDIUM TYPE: Floppy disk 

(B) COMPUTER: IBM PC compatible 

(C) OPERATING SYSTEM: PC-DOS/MS-DOS 

(D) SOFTWARE: Patentin Release #1.0, Version #1.25 (EPO) 

2S 



(2) INFORMATION FOR SEQ ID N0:1: 

30 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 15 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



40 

(xi) SEQUENCE DESCRIPTION: SEQ ID N0:1: 
ATGTGGACAG TTGTA 

45 



SO 
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(2) INFORMATION FOR SEQ ID NO: 2: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 15 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 2: 
ATGTGGATAG TTGTA 



(2) INFORMATION FOR SEQ ID NO: 3: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 15 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 3: 
ATGTGGAKAG TTGTA 
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(2) INFORMATION FOR SEQ ID NO: 4: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: IX base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 4: 
AAAACTGAAA A 



(2) INFORMATION FOR SEQ ID NO: 5: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 5: 
AAAACCGAAA A 
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(2) INFORMATION FOR SEQ ID NO: 6: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 17 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 6: 
AAACCCAATC CACATCA 



(2) INFORMATION FOR SEQ ID NO: 7: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 17 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 7: 
AAACCCAGTC CACATCA 
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(2) INFORMATION FOR SEQ ID NO: 8; 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 31 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 8: 
GGGGAAGCAG ATTTGGGTAC CACCCAAGTA T 



(2) INFORMATION FOR SEQ ID NO: 9: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 31 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 9: 
GGGGAAGCAG ATTTGAAMAC CACCCAAGTA T 
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(2) INFORMATION FOR SEQ ID NO: 10: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 10: 

GCATTAGTAG AGATATGTAC AGAAATGGAA AAGGAAGGGA AAATTTCAAA 
AATTGGGCC 



(2) INFORMATION FOR SEQ ID NO: 11: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 11: 

GCATTAGTAG AAATTTGTAC AGAGATGGAA AAGGAAGGGA AAATTTCAAA 
AATTGGGCC 
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(2) INFORMATION FOR SEQ ID NO: 12: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 12: 

GCATTAGTAG AGATATGGAG AGRARDGGRA AXXXAAGGGA AAATTNNNAA 
AATTGGGCC 



(2) INFORMATION FOR SEQ ID NO: 13: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 13: 

GCATTAGTAG AGATATGKAS AGRARDGGRA AXXXAAGGGA AAAKTNNNAA 
AATTGGGCC 
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(2) INFORMATION FOR SEQ ID NO: 14: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 14: 

GCATTAGTAG AGATATGKAS AGRRRDGGRA AXXXAAGGGA AAADTYNNAA 
AATTGGGCC 



(2) INFORMATION FOR SEQ ID NO: 15: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 15: 

GCATTAGTAG AGATATGTAS AGRRADGGAA AXGGAAGGGA AAATTNNNNA 
AATTGGGCC 
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(2) INFORMATION FOR SEQ ID NO: 16: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 16: 

GCATTAGTAG AGATATGTAC AGRGAGGGAA AXGGAAGGGA AAATTNNNNA 
AATTGGGCC 



(2) INFORMATION FOR SEQ ID NO: 17: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 17: 

GCATTAGTAG AGATATGTAS AGRGAGGGAA AXGGAAGGGA AAATTNNNNA 
AATTGGGCC 
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(2) INFORMATION FOR SEQ ID NO: 18: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 18: 

GCATTAGTAG GAGGNNNGAC AGGGRKGGAA AXXMAAGGGA AAAKTNNNAA 
AATTGGGCC 



(2) INFORMATION FOR SEQ ID NO: 19: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 19: 

TCGAGATAAT CTATGTCCTC GTCTACTATG TCATAATCTT CTTTACTTAA 
ACGGTCCTTT 

TACCTTTGGT TTTTACTATC CCCCTTAACC TCCAAAATAG TTTCATTCTG 
TCATGCTAGT 

CTATGGACAT CTTTAGACAC CTGTATTTCG ATATCCATGT 
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(2) INFORMATION FOR SEQ ID NO: 20: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 20: 

NNGAGATANN NTATGTCCTC GTCYACTATG TNANNNNNNN NNNNNNNNAA 
ACGGTCCTNN 

NNNNNNNNNN NNNNNNNNNN CNNCNTAACC TCCAAAATAN NNNNNNTCTN 
NNNNANNNNT 

CTANNNGNAG NNNNAGANAR NCCNNNNNNN NNATNCATGT 



(2) INFORMATION FOR SEQ ID NO: 21: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 21: 

TCGAGATAAT CTATGTCCTC GTCTACTATG TCATAATNNN NNNNACTTAA 
ACGGTCCTTT 

TACCTTTGGT TTTTACTATC CCCCTTAACC TCCAAAATAG TTTCATTCTG 
NCATANNAGT 

CTATGNGNNG NNNTAGACAG NCCNNNNTCG ATATCCATGT 
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(2) INFORMATION FOR SEQ ID NO: 22: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

Xii) MOLECULE TYPE: DNA (oligonucleotide) 



IS (xi) SEQUENCE DESCRIPTION: SEQ ID NO: 22: 

TCGAGATAAT CTATGTCCTC GTCTACTATG TCATAATCTT CTTTACTTAA 
ACGGTCCTTT 60 

20 TACCTTTGGT TTTTACTATC CNNCTTAACC TCCAAAATAG TTTCATTCTG 

TCATACTAGT 120 

CTATGGGTAG CTTTAGACCN CCGTATTTCG ATATCCATGT 160 

25 



(2) INFORMATION FOR SEQ ID NO: 23: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 
35 (D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO:23: 

TCGAGATAAT CTATGTCCTC GTCTACTATG TCATAATCTT CTTTACTTAA 
ACGGTCCTTT 60 

TACCTTTGGT TTTTACTATC CCNCTTAACC TCCAAAATAG TTTCATTCTG 
TCATACTAGT 120 

CTATGGGTAG CTTTAGACCC CCGTATTTCG ATATCCATGT 160 
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(2) INFORMATION FOR SEQ ID NO: 24: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 24: 

NCGGGATANT NTATGTCCTC GTCYACTATG TCANNNNNCN NNCNNNNCAA 
ACGGTCCNCC 

NNNNNCNNNN NNCNNCYANG AANCYCAACC TCCAAAATAN NNNNNNTCTN 
NNNNANNNCN 

CTNNNNNNAG NGNNAGACAC CTGTATNNNN NTATNCAYGT 



(2) INFORMATION FOR SEQ ID NO: 25: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 25: 

TCGRGATAAT CTATGTCCTC GTCTACTATG TCATAATCCN NNCNNCTCAA 
ACGGTCCTYC 

NCMACNNST CCCCTTAACC TCCAAAATAG TTTCATTCTG 

CTANNNNNAG NGTTAGACAC CTGTATTTCG ATATCCATGT 



38 



10 



40 



45 



SO 



55 



EP 0 717 113 A2 



(2) INFORMATION FOR SEQ ID NO: 26: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



IS (xi) SEQUENCE DESCRIPTION: SEQ ID NO: 26: 

TCGAGATAAT CTATGTCCTC GTCTACTATG TCATAATCCN NCCTACTCAA 
ACGGTCCTTC 60 

20 TACCTTTGGT TTTTACTATC CMCCTTAACC TCCAAAATAG TTTCATTCTG 

TCATACTAGT 120 

CTATGAGTAG CTTTAGACAC CTGTATTTCG ATATCCATGT 160 

25 



30 (2) INFORMATION FOR SEQ ID NO: 27: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

35 (C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 27: 

TCGAGATAAT CTATGTCCTC GTCTACTATG TCATAATCTT CTTTACYCAA 
ACGGTCCTXC 60 

TACCTTTGGT TTTTACTATC CCMCTTAACC TCCAAAATAG TTTCATTCTG 
TCATACTAGT 120 

CTATGAGTAG CTTTAGACAC CTGTATTTCG ATATCCATGT 160 
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(2) INFORMATION FOR SEQ ID NO: 28: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 17 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 28: 
AAACCCAATC CACATCM 



(2) INFORMATION FOR SEQ ID NO: 29: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 17 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 29: 
MMACNCANNC CACANNM 
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(2) INFORMATION FOR SEQ ID NO: 30: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 30: 
TTGGGTACCA C 



(2) INFORMATION FOR SEQ ID NO: 31: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 31: 
TTGAAMACCA C 
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(2) INFORMATION FOR SEQ ID NO: 32: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 32: 
ACAGAAATGG A 



(2) INFORMATION FOR SEQ ID NO: 33: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 33: 
AGAGRATDGG R 
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(2) INFORMATION FOR SEQ ID NO: 34: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 34: 
ASAGRRADGG A 



(2) INFORMATION FOR SEQ ID NO: 35: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 35: 
ACAGGGRRGG A 
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(2) INFORMATION FOR SEQ ID NO: 36: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 36: 
CTGGGGGGTA T 



(2) INFORMATION FOR SEQ ID NO: 37: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 37: 
CTGGCCSGTG T 



(2) INFORMATION FOR SEQ ID NO: 38: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 38: 
CTGGGCGGTA T 
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(2) INFORMATION FOR SEQ ID NO: 39: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleqtide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 39: 
CTGGCACGTG T 11 



Claims 

1 . In a computer system, a method of identifying an unknown base in a sample nucleic acid sequence, said metiiod 
comprising tlie steps of: 

inputting a plurality of probe intensities, each of said probe intensities being associated with a nucleic acid 

probe: 

30 said computer system comparing said plurality of probe intensities wherein each of said plurality of probe 

intensities is substantially proportional to said associated nucleic acid probe hybridizing with at least one nucleic 
acid sequence, said at least one nucleic acid sequence Including said sample sequence; and 
calling said unknown base according to results of said comparing step. 

35 2. In a computer system, a method of identifying an unknown base in a sample nucleic acid sequence, said method 
comprising the steps of: 

inputting a plurality of probe intensities, each of said probe intensities being associated with a nucleic acid 

probe; 

said computer system comparing said plurality of probe intensities wherein each of said plurality of probe 
40 intensities is substantially proportional to said associated nucleic acid probe hybridizing with said sanple sequence; 
and 

calling said unknown base according to results of said comparing step. 

3. The method of claim 2. wherein said comparing step includes the step of said computer system calculating a ratio 
45 of a higher probe intensity to a lower probe intensity. 

4. The method of daim 3. wherein said calling step includes the step of calling said unknown base according to said 
probe assodated with said higher probe intensity if said ratio is greater than a predetermined ratio value. 

so 5. The method of daim 4. wherein sakJ predetermined ratio value is approximately 1 .2. 

6. In a computer system, a method of identifying an unknown base in a sample nucleic acid sequence, said method 
comprising the steps of: 

inputting a first set of probe intensities, each of said probe intensities in saki first set being associated with 
55 a nucleic add probe and substantially proportional to said associated nucleic acid probe hybridizing with a reference 
nucleic add sequence; 

Inputting a second set of probe intensities, each of said probe intensities in said second set being associated 
with a nucleic acid probe and substantially proportional to said associated nucleic acid probe hybrkiizing with saki 
sample sequence; 
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said computer system comparing at least one of said probe intensities in said first set and at least one of 
said probe intensities in said second set; and 

calling said unknown base according to results of said comparing step. 

7. The method of daim 6, wherein said comparing step includes the steps of: 

calculating first ratios of a wild-type probe intensity to each probe intensity of a probe hybridizing with said 
reference sequence, wherein said wild-type probe intensity is associated with a wild-type probe; and 

calculating second ratios of the highest probe Intensity of a probe hybridizing with said sample sequence to 
each probe intensity of a probe hybridizing with said sample sequence. 

a. The method of claim 7, wherein said comparing step further includes the step of calculating third ratios of said first 
ratios to said second ratios. 



9. The method of daim 8, wherein said calling step indudes the step of calling said unknown base according to said 
probe assodated with a highest third ratio. 

10. The method of claim 6. wherein said comparing step indudes the step of calculating a ratio of a highest probe 
intensity in said first set to a highest intensity in said second set. 

1 1 . The method of claim 1 0. wherein sakJ comparing step further includes the step of comparing said ratio of neighboring 
nucleic add probes. 



12. In a computer system, a method of identifying an unknown base in a sample nudeic add sequence, said method 
comprising the steps of: 

inputting statistics about a plurality of experiments, each of said experiments producing probe intensities 
each being associated with a nucleic acid probe and substantially proportional to said associated nudeic acid probe 
hybridizing with a reference nucleic acid sequence; 

inputting a plurality of probe intensities, each of sakJ plurality of probe intensities being assodated with a 
nucleic ackj probe and substantially proportional to said associated nudeic acki probe hybridizing with said sample 
sequence; 

said computer system comparing at least one of said plurality of probe intensities with said statistics; and 
calling said unknown t>ase according to results of said comparing step. 

1 3. The method of claim 12, further comprising the step of cateulating said statistics. 

14. The method of daim 12. wherein saki statistics include a mean and standard de/iation. 

15. A method of processing first and second nudeic acid sequences, comprising the steps of: 

providing a plurality of nucleic acW probes; 

labeling said first nudeic acid sequence with a first marker; 

labeling saki second nucleic add sequence with a second marker; and 

hybrkJizing said first and second labeled nudeic acid sequences at the same time. 

16. The method of claim 15, wherein sakJ plurality of nucleic add probes are on a chip. 

17. The method of daim 15, further comprising ttie step of fragmenting sakI first and second nudeic add sequences 
at the same time. 



1 8. The method of claim 15, further comprising the step of scanning for said first and second markers on said chip, said 
first and second labeled nudeic add sequences being on sakI cNp. 

19. The metiiod of claim 15. wherein said first and second markers are fluorescent markers that emit light at different 
wavelengths upon excitation. 

20. In a computer system, a method of kJentifyIng mutations in a sample nudeic ackl sequence, said mettiod comprising 
the steps of: 

inputting a first set of probe intensities, each of said probe intensities in saki first set being associated with 
a nucleic add probe and substantially proportional to said associated nudeic ackl probe hybrWizing with a reference 
nucleic add sequence; 
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inputting a second set of probe intensities, each of said probe intensities in said second set being associated 
with a nucleic acid probe and substantially proportional to said associated nucleic acid probe hybridizing with said 
sample sequence; 

said computer system conrtparing probe intensities in said first set and probe intensities in said second set 
5 to select hybridization regions where said probe Intensities in said first set and said probe intensities in said second 
set differ; and 

identifying mutations according to characteristics of said selected regions. 

21 . The method of claim 20. wherein said selected regions are determined by comparing probe intensities of wild-type 
10 probes. 

22. The method of claim 21 . wherein said wild-type probes are complementary to a portion of said reference sequence. 

23. The metiiod of claim 21, wherein said identifying step further includes the steps of: 
IS analyzing a size of a selected region; 

identifying a likely position of a mutation in said selected region according to an intenrogation position of said 
nucleic add probes; and 

performing base calling at said likely position. 

20 24. In a corrputer system, a metiiod of analyzing a plurality of sequences of bases, said plurality of sequences including 
at least one reference sequence and at least one sample sequence, the method comprising the steps of: 
displaying said at least one reference sequence in a first area on a display device; and 
displaying said at least one sample sequence In a second area on said display device; 
whereby a user is capable of visually comparing said plurality of sequences. 

25 

25. The method ol daim 24. wherein saki plurality of sequences are monomer strands of DNA or RNA. 

26. The method of daim 24, wherein saki at least one reference sequence indudes a chip wikJ-type that has been tiled 
on a chip. 

30 

27. The method of daim 26. wherein sakJ chip wild-type sequence is displayed as a first sequence in said first area. 

28. The method of daim 26. further comprising the step of displaying a label in saki first area to Wentify said chip wild- 
type sequence. 

35 

29. The method of claim 24. wherein saki at least one sample sequence has been hybridized on a chip. 

30. The method of daim 24, further comprising the step of indicating bases that differ anrwng a plurality of user selected 
sequences. 

40 

31. The method of claim 24. further comprising the steps of: 

displaying a name associated with each of saki at least one reference sequence in said first area; and 
displaying a name associated with each of saki at least one sample sequence in saki second area. 

45 32. The method of daim 24. further comprising tine step of linking at least one reference sequence in said first area witii 
at least one sample sequence in saki second area. 

33. The method of claim 32. furttier comprising tiie step of indicating on saki display device whk^h sequences are linked. 

50 34. The metiiod of daim 24. furtiier comprising tiie step of indicating bases of saki at least one sample sequence tiiat 
are not equal to a conresponding base in said at least one reference sequence. 

35. The method of claim 24, wherein saki at least one reference sequence and saki at least one sample sequence are 
aligned on said display devk^e. hybridization with said probes. 

55 
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